# Preprocessing Code Example

### Part I: Set ups
This part includes the imported libraries and dataset. Feel free to change the path of the dataset accordingly.

In [None]:
import pandas as pd
import os
import numpy as np
import sklearn.preprocessing
import seaborn as sns
print(os.getcwd())

In [None]:
df = pd.read_csv('adult.data', header = None)
df.columns = ['age', 'workclass', 'fnlwgt', 'edu', 'edu-num', 'marital', 'occupation', 'relationship', 'race', 'sex', 'cap-gain', 'cap-loss','hpw','native country','income']
#a = df.loc[df['workclass'] == ' Private']
#c = a['income']
#df['workclass']
#df3 = df.drop(['workclass', 'edu', 'marital', 'occupation', 'race', 'relationship', 'sex', 'native country'], axis=1)
#sns.pairplot(df, hue = 'income')

In [None]:
df.head()

In [None]:
index = df['income'].value_counts()
index_list = index.index
print(index_list)

for i in range(len(index_list)):
    income = str(index_list[i])
    a = df.loc[df['income'] == income]
    c = a['workclass']
    print(income)
    print(c.value_counts())

### Part II: Basic Data Understanding

We can look at all values and their counts by using the following code. It is useful to understand all aspect of the dataset.

In [None]:
for i in df.columns:
    print(i)
    #print(df[str(i)].value_counts().index)
    print(df.loc[df[str(i)] == ' ?'])

### Part III: Preprocessing

The following lines of code show how you can normalize the data. Noted that if you want to normalize only one feature, you need to change the datatype accordingly. 

You can create a new dataframe with one feature by using old_df[['feature_name']]

More info at https://scikit-learn.org/stable/modules/preprocessing.html

In [None]:
df2 = df[['age']]

min_max_scaler = sklearn.preprocessing.MinMaxScaler()
min_max_scaler.fit_transform(df2)

Or you can change from array (df['feature_name']) to numpy array by using np.array()

In [None]:
x_scaled2=min_max_scaler.fit_transform(np.array(df['age']).reshape(1,-1))
print(x_scaled2.shape)

To preform one-hot encoding for categorical feature, you can use the following lines of code.

More info at https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html

In [None]:
df3 = df[['edu']]
df4 = pd.get_dummies(df3)


In [None]:
#df4['income'] = df['income'].values
#df4.head()
#sns.pairplot(df4, hue= 'income')

You can also perform feature selection on the whole dataframe.

More info at https://scikit-learn.org/stable/modules/feature_selection.html

In [None]:
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

x = df.iloc[:,:-1]    #Split only data
y = df.iloc[:,-1]     #Split the target out

In [None]:
# We can one-hot encode the dataframe. This line of code will encode only categorical features automatically.
x = pd.get_dummies(x)

# We then create the feature selector. In this case, we use chi-2 algorithm and we want to choose 4 features (k=4).
selector = SelectKBest(chi2, k=4)     #This line creates the selector
x_new = selector.fit(x,y)             #This line fits the selector to the dataset, and select the features.

# Once we fit the selector, all features are selected and its indices are saved. We can create a new dataframe with those indices.
col = selector.get_support(indices=True)   #all indices are saved in col.
x_new = x.iloc[:,col]

In [None]:
x_new.head()

You can also perform feature extraction. 

More info at https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components = 4)           # Create PCA transformer
x_pca = pca.fit_transform(x)          # Fit and transform PCA transformer to the dataset

print(pca.explained_variance_ratio_)  # This show the variance of each component.

The array above shows the explained variance of each component. The first element is 0.99511. It means that the first component can explain 99.51% of the dataset already. Thus, it means that we can reduce the dimension of the original dataset to only 1 feature that basically covers 99.51% of the data.

In [None]:
pd.DataFrame(x_pca).head()