# Preprocessing: Categorical and Numeric Features 

- Data of a column can be largely divided into the following three types:
    1. Numeric: Continuous number, such as age and time
    2. Categoric: Discrete values presented either as text (e.g. gender: female/male) or number (e.g. level: 1, 2, 3)
    3. Text: Words, sentences of words such as movie review
- Apart from item 3, the following discuss the preprocessing of numberic and categorical data

## Why preprocessing is needed? 

### For numberic features:
- The need for preprocessing is to "standardise" the values of each column and altogether as a whole in terms of means and variance
- Refered as scaling, this is to minimise the effects of extremely large or small values (within a and across columns) to affect the training model
    - There are many ways to scale numberic values, such as:
        - StandardScaler
        - RobustScaler
        - MinMaxScaler
        - Normalizer 

### For categorical features:
- Computer was not able to understand categorical data, such as "female and male" or "level: 1, 2, 3", if not being "told" as a number say, 0 means female and 1 means male
- Column of categorical data will need to be "unpivoted", making each value become a column title before feeding into a machine learning model, for example:
    - "gender" column will be divided into "female" and "male" columns, 
    - each row of data will be either 1 or 0 under female or male column to indicate its value

## ColumnTransfomer
- is a way to join all preprocessing of numberic and categorical columns together for preprocessing
- method .fit_transform() to be used on the training dataset (X_train)
- method .transform() to be used on the test dataset (X_test)
- for details please refer to https://towardsdatascience.com/what-and-why-behind-fit-transform-vs-transform-in-scikit-learn-78f915cf96fe    


*first, lets load the libraries and data*


### Libraries

In [7]:
import os
import pandas as pd

#for spliting a dataset into training and testing 
from sklearn.model_selection import train_test_split

#for transformation, categorical 
from sklearn.preprocessing import OneHotEncoder

#for transformation, numeric 
from sklearn.preprocessing import StandardScaler

#concatenate the preprocessing of categorical and numeric functions together to perform transformation
from sklearn.compose import ColumnTransformer



### Read in the data

In [8]:
pwd = os.getcwd()
data = os.path.join(pwd, "data.csv")
df = pd.read_csv(data)
df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


### Feature and target selection

In [9]:
features = df[["Pclass", "Sex","Fare"]]
target = df[["Survived"]]

# Categorical Feature



In [10]:
df[["Pclass"]].value_counts(ascending=False)

Pclass
3         491
1         216
2         184
dtype: int64

In [11]:
df[["Sex"]].value_counts(ascending=False)

Sex   
male      577
female    314
dtype: int64

# Numeric Feature

In [12]:
fare_types = df[["Fare"]].value_counts().count()

display("There are altogether {} of fares out of {} passengers, lets assume these are not categorical for now to demonstrate preprocessing of numeric features".format(fare_types, df.shape[0]))

'There are altogether 248 of fares out of 891 passengers, lets assume these are not categorical for now for demonstration in handling numeric features'

# Spliting Dataset
- This step is to split a dataset into training and testing sets for machine learing:
    - training set, named as X_train and y_train, for training a model
    - testing set, named as X_test and y_train, for testing/evaluating the performace of a model
- More on this in: splitting.ipynb 

In [13]:
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=42)

# Transformation of Categorical and Numeric Features
- combining preprocessing steps of both categorical and numeric data

In [14]:
#Within the ColumnTransformer function, a list function of ("name", preprocessing function, columns to be transformed)

ct = ColumnTransformer([
    ("onehot", OneHotEncoder(sparse=False), ["Pclass", "Sex"]),
    ("scaling", StandardScaler(),["Fare"])
    ])

In [34]:
#.fit_transform is the chain method of .fit and .transform putting together
#.fit_transform is used on the train train dataset

X_train_fit_trans = ct.fit_transform(X_train)
X_train_fit_trans

array([[ 1.        ,  0.        ,  0.        ,  0.        ,  1.        ,
        -0.0325683 ],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.48733085],
       [ 0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
        -0.34285405],
       ...,
       [ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.35045024],
       [ 1.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         1.7030926 ],
       [ 1.        ,  0.        ,  0.        ,  0.        ,  1.        ,
         0.8747751 ]])

In [30]:
X_test_trans = ct.transform(X_test)
X_test_trans

array([[ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.32839086],
       [ 0.        ,  1.        ,  0.        ,  0.        ,  1.        ,
        -0.42042549],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.4703621 ],
       ...,
       [ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.47092837],
       [ 0.        ,  1.        ,  0.        ,  1.        ,  0.        ,
        -0.37194334],
       [ 0.        ,  0.        ,  1.        ,  0.        ,  1.        ,
        -0.23207234]])

# For curiosity

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
logreg = LogisticRegression()

In [31]:
logreg.fit(X_train_trans, y_train)

  return f(*args, **kwargs)


LogisticRegression()

In [32]:
logreg.score(X_test_trans, y_test)

0.7757847533632287