<a href="https://colab.research.google.com/github/spyingcyclops/gisma/blob/main/my_feature_engineering_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Feature Engineering
You should build a machine learning pipeline with a data preprocessing and feature engineering step. In particular, you should do the following:
- Load the `adult` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). 
- Preprocess the dataset by 
    - removing missing values using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.dropna.html);
    - encoding categorical attributes using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html);
    - normalizing/scaling features using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html);
    - handling imbalanced classes using [Imbalanced-Learn](https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html);
    - and reducing the dimensionality of the dataset using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html).
- Train and test a support vector machine model using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html).
- Evaluate the impact of the data preprocessing and feature engineering methods on the effectiveness and efficiency of the model.
- Check the documentation to identify the most important hyperparameters, attributes, and methods. Use them in practice.

In [None]:
import pandas as pd
import sklearn.model_selection
import sklearn.preprocessing
import sklearn.svm
import sklearn.decomposition
import imblearn.over_sampling
import sklearn.metrics


df = pd.read_csv("../../datasets/adult.csv")
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,target
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [None]:
df_train, df_test = sklearn.model_selection.train_test_split(df)
print("df_train:", df_train.shape)
print("df_test:", df_test.shape)


df_train: (24420, 15)
df_test: (8141, 15)


In [None]:
#Data exploration
# - vizualization
# - range
# - correlations

df_train["target"].value_counts() #is target var balanced?

 <=50K    18521
 >50K      5899
Name: target, dtype: int64

In [None]:
#remove missing values
df_train = df_train.replace(" ?", pd.NaT)
df_train_cleaned = df_train.dropna()
print("df_train_cleaned: ", df_train_cleaned.shape)

df_test = df_test.replace(" ?", pd.NaT)
df_test_cleaned = df_test.dropna()
print("df_test_cleaned: ", df_test_cleaned.shape)

df_train_cleaned:  (22648, 15)
df_test_cleaned:  (7514, 15)


In [None]:
x_train = df_train_cleaned.drop(["target"], axis=1)
y_train = df_train_cleaned["target"]
print("x_train: ", x_train.shape)
print("y_train: ", y_train.shape)

x_test = df_test_cleaned.drop(["target"], axis=1)
y_test = df_test_cleaned["target"]
print("x_test: ", x_test.shape)
print("y_test: ", y_test.shape)

x_train:  (22648, 14)
y_train:  (22648,)
x_test:  (7514, 14)
y_test:  (7514,)


In [None]:
# Building the one-hot encoder model
enc = sklearn.preprocessing.OneHotEncoder(handle_unknown="ignore")
enc.fit(x_train)

# Encoding the categorical attributes of training data
x_train_encoded = enc.transform(x_train).toarray()

# Encoding the categorical attributes of test data
x_test_encoded = enc.transform(x_test).toarray()

print("x_train: ", x_train_encoded.shape)
print("x_test: ", x_test_encoded.shape)

x_train:  (22648, 16860)
x_test:  (7514, 16860)


In [None]:
# Standardization
# Building a standardization model
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(x_train_encoded)

# scaling the training features
x_train_standardized = scaler.transform(x_train_encoded)

# reducing the number of test features
x_test_standardized = scaler.transform(x_test_encoded)

print("x_train_standardized: ", x_train_standardized.shape)
print("x_test_standardized: ", x_test_standardized.shape)

x_train_standardized:  (22648, 16860)
x_test_standardized:  (7514, 16860)


In [None]:
# Dimensionality reduction
# building a PCA model
pca = sklearn.decomposition.PCA(n_components=250)
pca.fit(x_train_encoded)

# reducing the number of training features
x_train_reduced = pca.transform(x_train_encoded)

# reducing the number of test features
x_test_reduced = pca.transform(x_test_encoded)

print("x_train_reduced: ", x_train_reduced.shape)
print("x_test_reduced: ", x_test_reduced.shape)

x_train_reduced:  (22648, 250)
x_test_reduced:  (7514, 250)


In [None]:
# Oversampling
sm = imblearn.over_sampling.SMOTE()
x_train_balanced, y_train_balanced = sm.fit_resample(x_train_reduced, y_train)
y_train_balanced.value_counts()

 <=50K    17000
 >50K     17000
Name: target, dtype: int64

In [None]:
# Training a model
model = sklearn.svm.SVC()
model.fit(x_train_balanced, y_train_balanced)

SVC()

In [None]:
# testing the model
y_predicted = model.predict(x_test_reduced)
accuracy = sklearn.metrics.accuracy_score(y_test, y_predicted)
accuracy

0.8316475911631621