# 4. IMDB_Classification_Preprocess_and_Training<a id='2_Data_wrangling'></a>

## 4.1 Table of Contents<a id='2.1_Contents'></a>
* 4. Bank_Churnrate_Preprocess_and_Training
  * 4.1 Table of Contents
  * 4.2 Introduction
  * 4.3 Imports
  * 4.4 Load Bank Churn Missing and Dropped Datasets
    * 4.4.1 Intial Loading and Assessments of Datasets
    * 4.4.2 Avoid Multicollineraty (Dummy Variable Trap)
  * 4.5 Create Train_Test Split & Preprocess Training Data
    * 4.5.1 Create Train_Test_Split
    * 4.5.2 Apply StandardScaler() for Training Data
  * 4.6 Save Data
  * 4.7 Summary


## 4.2 Introduction

Now that a thorough analysis as well as further cleaning and One-Hot enconding steps were implemented in prior EDA, train and test sets based on a usual 70/30 split will be created based on both the missing and dropped bank churn datasets for future steps in assessing model performance.

## 4.3 Imports<a id='2.3_Imports'></a>

Importing all appropriate packages in order to develop associated train and test sets for both imputed and dropped datasets for bank churn rate.

In [12]:
#Import pandas, matplotlib.pyplot, seaborn, and associated scikit learn methods and functions as well as random number for reproducibility
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import os
from PIL import Image
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
random_number = 42

## 4.4 Load Bank Churn Missing and Dropped Datasets

### 4.4.1 Intial Loading and Assessments of Datasets

In [2]:
#Loading associated datasets of dropped vs. missing datasets for movie_df_filtered.
path_file = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/movie_df_filtered.csv'
movie_df_filtered = pd.read_csv(path_file, index_col=0)

Auditing the datasets with .info() and .head() displaying the first few records.

In [3]:
#.info() on bank_missing_df and bank_dropped_df to see a summary of the data
movie_df_filtered.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3939 entries, 0 to 3971
Data columns (total 6 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   id       3939 non-null   object
 1   image    3939 non-null   object
 2   title    3939 non-null   object
 3   Genre_1  3939 non-null   object
 4   Genre_2  3939 non-null   object
 5   Genre_3  3939 non-null   object
dtypes: object(6)
memory usage: 215.4+ KB


In [4]:
#head method on bank_missing_df and bank_dropped_df to print the first several rows of the data
movie_df_filtered.head()

Unnamed: 0,id,image,title,Genre_1,Genre_2,Genre_3
0,tt0099785,https://m.media-amazon.com/images/M/MV5BMzFkM2...,Home Alone,Comedy,Family,
1,tt0100944,https://m.media-amazon.com/images/M/MV5BMjI1MD...,The Witches,Adventure,Comedy,Family
2,tt0099810,https://m.media-amazon.com/images/M/MV5BZDdkOD...,The Hunt for Red October,Action,Adventure,Thriller
3,tt0099088,https://m.media-amazon.com/images/M/MV5BYjhlMG...,Back to the Future Part III,Adventure,Comedy,Sci-Fi
4,tt0100419,https://m.media-amazon.com/images/M/MV5BYzk3NG...,Problem Child,Comedy,Family,


In [5]:
movie_df_filtered.shape

(3939, 6)

### 4.4.2 Avoid Multicollineraty (Dummy Variable Trap)

In order to avoid the dummy variable trap or potentially having issues with the feautres because of multicollineraty within the dummy variables, the different one-hot enconded categorical varibles will have k-1 features (where k is the number of unique categories within the variable). The baseline values will be values of the categories that occur most frequently in the respective categorical variable (ex. Card_Category_Blue). 

Further information and reference for the above paragraph found on: https://www.statology.org/dummy-variable-trap/

In [80]:
movie_dummy_1 = movie_df_filtered[['id','Genre_1', 'Genre_2', 'Genre_3']]
movie_dummy = pd.get_dummies(movie_dummy_1, columns = ['Genre_1','Genre_2','Genre_3'])
movie_dummy.head()

Unnamed: 0,id,Genre_1_Action,Genre_1_Adventure,Genre_1_Animation,Genre_1_Biography,Genre_1_Comedy,Genre_1_Crime,Genre_1_Documentary,Genre_1_Drama,Genre_1_Family,...,Genre_3_ Musical,Genre_3_ Mystery,Genre_3_ News,Genre_3_ Romance,Genre_3_ Sci-Fi,Genre_3_ Sport,Genre_3_ Thriller,Genre_3_ War,Genre_3_ Western,Genre_3_None
0,tt0099785,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,tt0100944,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,tt0099810,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,tt0099088,0,1,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,tt0100419,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [35]:
shape = (500,500)
img = Image.open('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Image_1/Image_1/tt3205010.jpg')
print(img.filename)
img_re = img.resize(shape)
img_re.show()
img_array = np.asarray(img_re)


C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Image_1/Image_1/tt3205010.jpg


In [38]:
img_2 = Image.open('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Image_1/Image_1/tt3210796.jpg')
img2_re = img_2.resize(shape)
img2_re.show()
img_2array = np.asarray(img2_re)
img_2array

array([[[122, 153, 158],
        [121, 153, 158],
        [123, 153, 159],
        ...,
        [101, 129, 144],
        [ 80, 109, 127],
        [168, 179, 186]],

       [[123, 151, 157],
        [122, 151, 157],
        [123, 151, 157],
        ...,
        [ 90, 119, 136],
        [ 70,  99, 118],
        [166, 177, 183]],

       [[121, 150, 156],
        [121, 150, 156],
        [123, 150, 157],
        ...,
        [ 89, 117, 136],
        [ 69,  97, 118],
        [165, 175, 183]],

       ...,

       [[ 35,  53,  67],
        [ 35,  53,  67],
        [ 35,  53,  67],
        ...,
        [ 35,  52,  66],
        [ 30,  48,  63],
        [156, 163, 169]],

       [[ 35,  53,  67],
        [ 35,  53,  67],
        [ 35,  53,  67],
        ...,
        [ 35,  52,  66],
        [ 30,  48,  63],
        [156, 163, 169]],

       [[ 35,  53,  67],
        [ 35,  53,  67],
        [ 35,  53,  67],
        ...,
        [ 35,  52,  66],
        [ 30,  48,  63],
        [156, 163, 169]]

In [86]:
def image_open(path):
    shape = (500,500)
    img = Image.open(path)
    img_re = img.resize(shape)
    color = img_re.convert("RGB")
    img_array = np.asarray(color.getdata())
    image_df = pd.DataFrame(img_array, columns=["red", "green", "blue"])
    image_df = pd.DataFrame(img_array)
    return image_df

In [87]:
image = image_open('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Image_1/Image_1/tt3205010.jpg')
image

Unnamed: 0,0,1,2
0,224,232,253
1,118,174,253
2,125,175,253
3,125,176,250
4,127,174,254
...,...,...,...
249995,254,219,153
249996,254,219,152
249997,253,218,148
249998,253,222,156


In [88]:
path = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/IMDB_Classification/0_Datasets/Image_1/Image_1'
image_data = []
movie_genre = []
for i in os.listdir(path):
    if i.endswith(".jpg"):
        image_data.append(image_open(f"{path}/{i}"))
        movie_genre.append(i)



In [89]:
image_data

[        0   1   2
 0       8  21  27
 1       8  21  27
 2       8  21  27
 3       8  21  27
 4       8  21  27
 ...    ..  ..  ..
 249995  6  22  32
 249996  7  23  32
 249997  7  23  33
 249998  8  24  33
 249999  8  24  33
 
 [250000 rows x 3 columns],
          0   1   2
 0       69  58  38
 1       72  61  41
 2       73  62  42
 3       73  62  42
 4       69  58  38
 ...     ..  ..  ..
 249995   0   0   0
 249996   0   0   0
 249997   0   0   0
 249998   0   0   0
 249999   0   0   0
 
 [250000 rows x 3 columns],
           0    1    2
 0       112  236  254
 1       117  236  254
 2       118  238  255
 3       120  238  254
 4       118  239  253
 ...     ...  ...  ...
 249995   54   11    1
 249996   64   10    1
 249997   55   18    3
 249998   63   19    2
 249999   73   17    2
 
 [250000 rows x 3 columns],
          0   1   2
 0       58  51  90
 1       58  52  91
 2       58  52  90
 3       57  52  90
 4       57  52  90
 ...     ..  ..  ..
 249995  10   5   3
 24999

In [105]:
movie_gen = pd.Series(movie_genre)
movie_gen = movie_gen.str.replace('""', "")
gen = []
for i in range(len(movie_gen)):
    gen.append(movie_gen[i].rsplit(".",1)[0])
#movie_genr = movie_gen.tolist()
#movie_genr[1].rsplit(".",1)[0]

In [106]:
genr = pd.Series(gen)
genr

0       tt10009602
1       tt10034258
2       tt10055028
3       tt10075380
4       tt10075830
           ...    
1186     tt9844368
1187     tt9848626
1188     tt9876582
1189     tt9883996
1190     tt9896916
Length: 1191, dtype: object

In [None]:
mov_dict = {}
for i in range(len(genr)):
    mov_dict['id'] = genr[i]
    mov_dict['Genre_1'] = movie_dummy_1[movie_dummy_1['id'].isin(genr[i])]

In [108]:
movie_fil = movie_dummy_1[movie_dummy_1['id'].isin(genr)]
movie_fil['id']

136      tt7435570
200     tt10034258
460      tt6590506
540      tt3529404
735      tt3317652
           ...    
3965    tt29702487
3966    tt14399126
3967    tt21862696
3968    tt13932976
3969    tt27838714
Name: id, Length: 1171, dtype: object

In [67]:
#Drop features within both missing and dropped datasets to avoid dummy variable collineraty 



features_to_drop = ['Gender_F', 'Education_Level_Graduate', 'Marital_Status_Married', 'Income_Category_Less than $40K', 'Card_Category_Blue']

bank_missing_df_dummy_dropped = bank_missing_df.drop(features_to_drop, axis=1)
bank_dropped_df_dummy_dropped = bank_dropped_df.drop(features_to_drop, axis=1)

In [68]:
#Assess new dataframe of dropped features bank_missing_df_dummy_dropped
bank_missing_df_dummy_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 8101 entries, 0 to 8100
Data columns (total 33 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer_Age                   8101 non-null   int64  
 1   Dependent_count                8101 non-null   int64  
 2   Months_on_book                 8101 non-null   int64  
 3   Total_Relationship_Count       8101 non-null   int64  
 4   Months_Inactive_12_mon         8101 non-null   int64  
 5   Contacts_Count_12_mon          8101 non-null   int64  
 6   Credit_Limit                   8101 non-null   float64
 7   Total_Revolving_Bal            8101 non-null   int64  
 8   Avg_Open_To_Buy                8101 non-null   float64
 9   Total_Amt_Chng_Q4_Q1           8101 non-null   float64
 10  Total_Trans_Amt                8101 non-null   int64  
 11  Total_Trans_Ct                 8101 non-null   int64  
 12  Total_Ct_Chng_Q4_Q1            8101 non-null   f

In [69]:
bank_missing_df_dummy_dropped.head()

Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,...,Marital_Status_Single,Marital_Status_missing,Income_Category_$120K +,Income_Category_$40K - $60K,Income_Category_$60K - $80K,Income_Category_$80K - $120K,Income_Category_missing,Card_Category_Gold,Card_Category_Platinum,Card_Category_Silver
0,54,1,36,1,3,3,3723.0,1728,1995.0,0.595,...,1,0,0,0,0,0,1,0,0,0
1,58,4,48,1,4,3,5396.0,1803,3593.0,0.493,...,0,0,0,0,0,0,1,0,0,0
2,45,4,36,6,1,3,15987.0,1648,14339.0,0.732,...,1,0,0,0,0,0,0,1,0,0
3,34,2,36,4,3,4,3625.0,2517,1108.0,1.158,...,1,0,0,0,0,0,0,0,0,0
4,49,2,39,5,3,4,2720.0,1926,794.0,0.602,...,0,0,0,1,0,0,0,0,0,0


In [70]:
bank_missing_df_dummy_dropped.shape

(8101, 33)

In [71]:
#Assess new dataframe of dropped features bank_dropped_df_dummy_dropped
bank_dropped_df_dummy_dropped.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5690 entries, 3 to 8100
Data columns (total 30 columns):
 #   Column                         Non-Null Count  Dtype  
---  ------                         --------------  -----  
 0   Customer_Age                   5690 non-null   int64  
 1   Dependent_count                5690 non-null   int64  
 2   Months_on_book                 5690 non-null   int64  
 3   Total_Relationship_Count       5690 non-null   int64  
 4   Months_Inactive_12_mon         5690 non-null   int64  
 5   Contacts_Count_12_mon          5690 non-null   int64  
 6   Credit_Limit                   5690 non-null   float64
 7   Total_Revolving_Bal            5690 non-null   int64  
 8   Avg_Open_To_Buy                5690 non-null   float64
 9   Total_Amt_Chng_Q4_Q1           5690 non-null   float64
 10  Total_Trans_Amt                5690 non-null   int64  
 11  Total_Trans_Ct                 5690 non-null   int64  
 12  Total_Ct_Chng_Q4_Q1            5690 non-null   f

In [72]:
bank_dropped_df_dummy_dropped.head()

Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,...,Education_Level_Uneducated,Marital_Status_Divorced,Marital_Status_Single,Income_Category_$120K +,Income_Category_$40K - $60K,Income_Category_$60K - $80K,Income_Category_$80K - $120K,Card_Category_Gold,Card_Category_Platinum,Card_Category_Silver
3,34,2,36,4,3,4,3625.0,2517,1108.0,1.158,...,0,0,1,0,0,0,0,0,0,0
4,49,2,39,5,3,4,2720.0,1926,794.0,0.602,...,0,0,0,0,1,0,0,0,0,0
5,60,0,45,5,2,4,1438.3,648,790.3,0.477,...,0,0,0,0,0,0,0,0,0,0
8,30,0,36,3,3,2,2550.0,1623,927.0,0.65,...,0,0,0,0,0,0,0,0,0,0
9,33,3,36,5,2,3,1457.0,0,1457.0,0.677,...,0,0,1,0,0,0,0,0,0,0


In [73]:
bank_dropped_df_dummy_dropped.shape

(5690, 30)

## 4.5 Create Train_Test Split & Preprocess Training Data

### 4.5.1 Create Train_Test_Split

For implementation of future modeling, a train_test_split is created for both the missing and dropped datasets. The test size is arbritraily set for a 70/30 split.

In [74]:
#Creating train_test_split for both missing and dropped bank churn datasets.
#Test size arbritarily set at 70/30 split for training and future model performance and testing.

X_missing = bank_missing_df_dummy_dropped.drop(['Attrition_Flag'], axis=1)
y_missing = bank_missing_df_dummy_dropped['Attrition_Flag']
X_dropped = bank_dropped_df_dummy_dropped.drop(['Attrition_Flag'], axis=1)
y_dropped = bank_dropped_df_dummy_dropped['Attrition_Flag']


X_train_missing, X_test_missing, y_train_missing, y_test_missing = train_test_split(X_missing, y_missing, test_size=0.30, random_state = random_number) 
X_train_dropped, X_test_dropped, y_train_dropped, y_test_dropped = train_test_split(X_dropped, y_dropped, test_size=0.30, random_state= random_number)

### 4.5.2 Apply StandardScaler() for Training Data

In order to create equal distributions with mean centered at 0 between each of the features for future modeling, a standard scaler object is created and defined for training for both datasets. The same scaler is also applied for transforming X_test datasets for consistency and future modeling.

In [75]:
# Created function for fitting and transforming scaler object for X_train of both datasets and then transform X_test for both datasets
def scaling(train , test):
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(train)
    X_test_scaled = scaler.transform(test)
    
    return X_train_scaled, X_test_scaled

In [76]:
X_train_scaled_missing , X_test_scaled_missing = scaling(X_train_missing, X_test_missing)

In [77]:
X_train_scaled_dropped , X_test_scaled_dropped = scaling(X_train_dropped, X_test_dropped)

Reviewing as well the above arrays created after scaling the trained datasets.

In [78]:
#Reviewing of X_train_scaled_missing and dropped as well as X_test_scaled_missing and dropped
X_train_scaled_missing

array([[-0.41349078,  1.28797953, -1.35293228, ..., -0.11017496,
        -0.04408862, -0.24090084],
       [ 1.45572316, -1.03949649,  0.0039595 , ..., -0.11017496,
        -0.04408862, -0.24090084],
       [ 0.33419479,  1.28797953, -0.85951709, ..., -0.11017496,
        -0.04408862, -0.24090084],
       ...,
       [ 0.08496627,  2.06380487,  0.6207285 , ..., -0.11017496,
        -0.04408862, -0.24090084],
       [ 0.08496627,  0.51215419,  0.8674361 , ..., -0.11017496,
        -0.04408862, -0.24090084],
       [ 1.58033743, -1.03949649,  2.10097409, ..., -0.11017496,
        -0.04408862, -0.24090084]])

In [79]:
X_test_scaled_missing

array([[ 1.33110890e+00, -2.63671151e-01,  3.95950466e-03, ...,
        -1.10174961e-01, -4.40886190e-02, -2.40900841e-01],
       [-4.13490784e-01,  1.28797953e+00, -2.42748093e-01, ...,
        -1.10174961e-01, -4.40886190e-02, -2.40900841e-01],
       [-1.90886194e+00, -1.03949649e+00, -1.72299368e+00, ...,
        -1.10174961e-01, -4.40886190e-02, -2.40900841e-01],
       ...,
       [-2.88876521e-01, -1.03949649e+00, -2.42748093e-01, ...,
        -1.10174961e-01, -4.40886190e-02,  4.15108555e+00],
       [ 2.09580531e-01,  5.12154188e-01, -7.36163289e-01, ...,
        -1.10174961e-01, -4.40886190e-02, -2.40900841e-01],
       [ 3.34194794e-01,  1.28797953e+00,  8.67436097e-01, ...,
        -1.10174961e-01, -4.40886190e-02, -2.40900841e-01]])

In [80]:
X_train_scaled_dropped

array([[-1.50889427,  0.52736863, -0.98537568, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [-0.63893519, -1.02914063, -0.48336093, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [ 0.35530376,  2.08387789, -0.10684986, ...,  9.5722467 ,
        -0.03884166, -0.23532724],
       ...,
       [ 1.47382257,  1.30562326,  2.15221653, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [ 1.22526283, -1.02914063,  1.02268333, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [-1.50889427, -0.250886  , -2.86793101, ..., -0.10446868,
        -0.03884166, -0.23532724]])

In [81]:
X_test_scaled_dropped

array([[-1.13605466,  1.30562326,  0.01865382, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [ 0.35530376,  1.30562326,  0.01865382, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [ 0.85242323,  0.52736863,  1.3991944 , ..., -0.10446868,
        -0.03884166, -0.23532724],
       ...,
       [-0.88749493, -0.250886  , -0.60886462, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [-0.26609559,  1.30562326, -0.73436831, ..., -0.10446868,
        -0.03884166, -0.23532724],
       [-1.26033453,  0.52736863,  0.01865382, ..., -0.10446868,
        -0.03884166, -0.23532724]])

Now that arrays have been created for X_train, X_test, y_train, and y_test for each missing and dropped datasets, these will be utilized in the next steps regarding assessing model performance.

## 4.6 Save data

In [82]:
#Missing X_train and y_train set shape
print(f"X_train_scaled_missing shape: {X_train_scaled_missing.shape} and y_train_missing shape: {y_train_missing.shape}")

X_train_scaled_missing shape: (5670, 32) and y_train_missing shape: (5670,)


In [83]:
#Missing X_test and y_test set shape
print(f"X_test_scaled_missing shape: {X_test_scaled_missing.shape} and y_test_missing shape: {y_test_missing.shape}")

X_test_scaled_missing shape: (2431, 32) and y_test_missing shape: (2431,)


In [84]:
#Dropped X_train and y_train set shape
print(f"X_train_scaled_dropped shape: {X_train_scaled_dropped.shape} and y_train_dropped shape: {y_train_dropped.shape}")

X_train_scaled_dropped shape: (3983, 29) and y_train_dropped shape: (3983,)


In [85]:
#Dropped X_test and y_test set shape
print(f"X_test_scaled_dropped shape: {X_test_scaled_dropped.shape} and y_test_dropped shape: {y_test_dropped.shape}")

X_test_scaled_dropped shape: (1707, 29) and y_test_dropped shape: (1707,)


In [86]:
#Created dataframes for the new and respectice X_train_scaled and y_test arrays
#Code for setting arrays as df and then saving to .csv found on https://stackoverflow.com/questions/6081008/dump-a-numpy-array-into-a-csv-file
df_X_train_scaled_missing = pd.DataFrame(X_train_scaled_missing)
df_y_train_missing = pd.DataFrame(y_train_missing)
df_X_test_scaled_missing = pd.DataFrame(X_test_scaled_missing)
df_y_test_missing = pd.DataFrame(y_test_missing)
df_X_train_scaled_dropped = pd.DataFrame(X_train_scaled_dropped)
df_y_train_dropped = pd.DataFrame(y_train_dropped)
df_X_test_scaled_dropped = pd.DataFrame(X_test_scaled_dropped)
df_y_test_dropped = pd.DataFrame(y_test_dropped)

In [87]:
# save the new datasets to new csv files for future model implementation
datapath = 'C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets'
df_X_train_scaled_missing.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/X_train_scaled_missing.csv', header=True, index=True) 
df_y_train_missing.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/y_train_missing.csv', header=True, index=False) 
df_X_test_scaled_missing.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/X_test_scaled_missing.csv', header=True, index=False)
df_y_test_missing.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/y_test_missing.csv', header=True, index=False)
df_X_train_scaled_dropped.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/X_train_scaled_dropped.csv', header=True, index=False)
df_y_train_dropped.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/y_train_dropped.csv', header=True, index=False)
df_X_test_scaled_dropped.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/X_test_scaled_dropped.csv', header=True, index=False)
df_y_test_dropped.to_csv('C:/Users/tpooz/OneDrive/Desktop/Data_Science_BootCamp_2023/SpringBoard_Github/Bank-Churnrate/0_Datasets/Test_Train_Sets/y_test_dropped.csv', header=True, index=False)

## 4.7 Summary

After applying scikit learns train_test_split and StandardScaler in order to create arbitray 70/30 splits of the datasets for both the missing and dropped bank churn datasets, these will be applied in the next steps regarding model performance. Some notes regarding the steps involved include avoiding the dummy variable trap or multicollineratiy between one-hot encoded feature variables by instead apply only k-1 feature values for the categorical variables. And a function was also created in order to apply the StandardScaler() object for the two different datasets as otherwise the object could not fit_transform for a different dataset after already being applied to a previous one. 

Notes to keep in mind is especially for the dropped dataset, the training data because of starting being dropped for missing values has further decreased for a training set of 3983 observations. This further decrease in terms of observations might affect different model performance.