# DATA PRE-PROCESSING
    Eliminating noise from dataset
    Dataset : https://www.kaggle.com/c/titanic/data
    'Survived' is to be predicted. 'train.csv' is used below.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

Here we load dataset stored in csv format using Pandas library

In [2]:
data_set = pd.read_csv('train.csv')
data_set.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


We create a dataframe using same the library Pandas for the dataset we loaded.
The statement X = df.iloc[:, df.columns!='Survived'] is also creating a dataframe having Data for all features except 'Survived'.

describe() can be used on dataframe.

In [3]:
df = pd.DataFrame(data_set)
#df.columns

In [4]:
X = df.iloc[:,df.columns != 'Survived']
Y = df.iloc[:,df.columns == 'Survived']
X_column = X.columns

In [5]:
X.describe()

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,2.0,20.125,0.0,0.0,7.9104
50%,446.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,3.0,38.0,1.0,0.0,31.0
max,891.0,3.0,80.0,8.0,6.0,512.3292


Here, we are calculating how many NaN are there for each feature in X (dataframe).

In [6]:
X.isnull().sum() 

PassengerId      0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## Replacing 'NaN' in the dataset
    To convert dataframe to numpy array, use :
        numpy_array_name = dataframe_variable_name.values

    e.g.
        X = X.values
    
    Now, we replace the missing values for the feature 'Age' using mean value.
    imputer.fit() takes 2D array as input. So, X[:,4] gives a vector and X[:,4:5] 
    gives an 2D array.

    The Imputer fills missing values with some statistics (e.g. mean, median, ...) 
    of the data. To avoid data leakage during cross-validation, 
    it computes the statistic on the train data during the fit, stores it and 
    uses it on the test data, during the transform.

In [7]:
X = X.values
Y = Y.values
# filling the missing values in age using mean strategy
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0)
imputer = imputer.fit(X[:,4:5])
X[:,4:5] = imputer.transform(X[:,4:5])

### Visualising the modified dataset

In [8]:
temp=pd.DataFrame(X)
temp.columns = X_column
temp.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
1,2,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38,1,0,PC 17599,71.2833,C85,C
2,3,3,"Heikkinen, Miss. Laina",female,26,0,0,STON/O2. 3101282,7.925,,S
3,4,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35,1,0,113803,53.1,C123,S
4,5,3,"Allen, Mr. William Henry",male,35,0,0,373450,8.05,,S


## Encoding the categorical features
        Encoding the categorical feautes (strings) into numerics and also eliminating the issue of relation analysis
        among these encoding features which hold no importance. 
        
        e.g.
            If 'France' is encoding into 1 and 'Germany' into 2, then we know 2>1 but we can't say that with France
            and Germany.
            
        Using Data.csv, we have shown above mentioned.

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

#importing datasets
dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1]
X_column = X.columns
#X = X.values
Y = dataset.iloc[:,3].values

#### We can see that 'Country' is a categorical data. So, we need to encode it.
#### Also have to eliminate the relation being established by the model among countries.

In [10]:
X.head()

Unnamed: 0,Country,Age,Salary
0,France,44.0,72000.0
1,Spain,27.0,48000.0
2,Germany,30.0,54000.0
3,Spain,38.0,61000.0
4,Germany,40.0,


In [11]:
X = X.values

#handling missing data
from sklearn.preprocessing import Imputer
imputer= Imputer(missing_values = 'NaN', strategy='mean', axis=0)
imputer = imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

from sklearn.preprocessing import LabelEncoder
#creating object of the class LabelEncoder

labelencoder_X = LabelEncoder()
X[:,0]=labelencoder_X.fit_transform(X[:,0])


In the above code, label-encoder is encoding the countries into integer values and return a single array.
You can see France is encoding to 1, Spain to 2 and Germany to 3.

Now, we need eliminate any mathematical use of these encoding.

In [12]:
X

array([[0, 44.0, 72000.0],
       [2, 27.0, 48000.0],
       [1, 30.0, 54000.0],
       [2, 38.0, 61000.0],
       [1, 40.0, 63777.77777777778],
       [0, 35.0, 58000.0],
       [2, 38.77777777777778, 52000.0],
       [0, 48.0, 79000.0],
       [1, 50.0, 83000.0],
       [0, 37.0, 67000.0]], dtype=object)

In [13]:
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(categorical_features = [0])
X = ohe.fit_transform(X).toarray()

In [14]:
X

array([[1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.40000000e+01,
        7.20000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 2.70000000e+01,
        4.80000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 3.00000000e+01,
        5.40000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.80000000e+01,
        6.10000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 4.00000000e+01,
        6.37777778e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.50000000e+01,
        5.80000000e+04],
       [0.00000000e+00, 0.00000000e+00, 1.00000000e+00, 3.87777778e+01,
        5.20000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 4.80000000e+01,
        7.90000000e+04],
       [0.00000000e+00, 1.00000000e+00, 0.00000000e+00, 5.00000000e+01,
        8.30000000e+04],
       [1.00000000e+00, 0.00000000e+00, 0.00000000e+00, 3.70000000e+01,
        6.70000000e+04]])

From the above code, it can visualised that three more columns have been add to X matrix. Each of the new columns added
represents of the categorical data.

e.g.
    First column represents France so it is having value 1 for the first tuple and so on.