In [46]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

- **Author**: Alva rani James, PhD, year: 2023

## What is supervised learning

- Supervised learning refers to a branch of Machine Learning that involves analyzing the relationship between a set of independent variables and a dependent variable
- In this type of learning, the variables used to make predictions are called independent variables, while the variable being predicted is referred to as the dependent variable
- For instance, if we aim to predict a person's age based on their height and weight, height and weight would be the independent variables, while age would be the dependent variable

- This excercise will provide a comprehensive overview of supervised learning and apply various algorithms to assess their comparative accuracy.





- Generally Supervised Learning is used for classification problems, where we predict whether a data-set will belong to one category or the other

## Import data

- Download the titatnic dataset from the kaggle
- There are 11 variables using which we have to predict whether a person will survive the accident or not
-- Let’s explore the data set before applying different algorithms for the prediction


In [5]:
titanic=pd.read_csv("train.csv")
titanic.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# get the diamension of the data frame
# use shape function for that
titanic.shape
# 891 rows and 12 columns

(891, 12)

- 11 Independent features and 1 Dependent (Survived Column)

## Supervised algorithms that we are going to apply on the above dataset:

- Logistic Regression: https://machinelearningmastery.com/logistic-regression-for-machine-learning/
- K-NN Algorithm:https://machinelearningmastery.com/tutorial-to-implement-k-nearest-neighbors-in-python-from-scratch/
- Naive Bayes Theorem: https://machinelearningmastery.com/naive-bayes-tutorial-for-machine-learning/
- Linear Support Vector Machines: https://machinelearningmastery.com/support-vector-machines-for-machine-learning/
- Non-Linear Support Vector Machines: https://machinelearningmastery.com/support-vector-machines-for-machine-learning/
- Decision Trees: https://machinelearningmastery.com/classification-and-regression-trees-for-machine-learning/
- Random Forest: https://machinelearningmastery.com/bagging-and-random-forest-ensemble-algorithms-for-machine-learning/

This excercise covers the application of the above algorithms without any explannations. I have added links of each algorithms for further read.

Application of the entire things can be broken down into four parts:-
1. Data Pre-processing & Cleansing
2. Splitting Data into Training and Test Set
3. Applying all the above algorithms
4. Comparing the accuracy scores


### Data preprocessing and cleaning

 We will remove columns from both data frames if they are deemed unimportant or contain more than 80% null values.

Once we have removed unnecessary columns, we will handle null values. If a column contains less than 80% null values, we will replace them with either the category with the highest count (for categorical columns) or the mean value of the column (for numerical columns). If a column has between 60-80% null values and is considered important, we can create a new category to replace the null values.

For the numerical columns, we will generate a box plot to identify any outliers and replace them with appropriate values. Finally, we will apply all the aforementioned approaches to the data set.

##### 1.1 split into categorical and numerical dataframes

- The first step is to split the data into two separate data frames: Categorical and Numerical. The Categorical data frame will include all columns with categorical data, while the Numerical data frame will contain all columns with numerical data.

In [15]:
cat_var = titanic.select_dtypes(object)
num_var = titanic.select_dtypes(np.number)

In [11]:
print("Titanic categorical dataframe",cat_var.head())
print("Titanic numerical dataframe",num_var.head())

Titanic categorical dataframe                                                 Name     Sex  \
0                            Braund, Mr. Owen Harris    male   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female   
2                             Heikkinen, Miss. Laina  female   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female   
4                           Allen, Mr. William Henry    male   

             Ticket Cabin Embarked  
0         A/5 21171   NaN        S  
1          PC 17599   C85        C  
2  STON/O2. 3101282   NaN        S  
3            113803  C123        S  
4            373450   NaN        S  
Titanic numerical dataframe    PassengerId  Survived  Pclass   Age  SibSp  Parch     Fare
0            1         0       3  22.0      1      0   7.2500
1            2         1       1  38.0      1      0  71.2833
2            3         1       3  26.0      0      0   7.9250
3            4         1       1  35.0      1      0  53.1000
4            5         0 

- From the categorical dataframe lets drop or remove unwanted columns
 -- tickets column and name column 

In [16]:
c.drop(['Name','Ticket'], axis=1, inplace=True)
cat_var.head()

Unnamed: 0,Sex,Cabin,Embarked
0,male,,S
1,female,C85,C
2,female,,S
3,female,C123,S
4,male,,S


In [17]:
#check for null values
cat_var.isnull().sum()


Sex           0
Cabin       687
Embarked      2
dtype: int64

- The Cabin column has a total of 687 null values, while the Embarked column has only 2. 
- Considering that the dataset contains a total of 891 rows, this means that approximately 77% of the Cabin column is composed of null values, which is close to 80%. 
--- To handle this, we can either delete the affected rows or replace the null values with the category that appears most frequently
- For the time being, let's opt for the latter option and proceed with the replacement.

In [20]:
# Replace all the null values present with the maximum count Category
cat_var.Cabin.fillna(cat_var.Cabin.value_counts().idxmax(), inplace=True)
cat_var.Embarked.fillna(cat_var.Embarked.value_counts().idxmax(), inplace=True)

In [21]:
# Now check again for the null values
cat_var.isnull().sum()

Sex         0
Cabin       0
Embarked    0
dtype: int64

- The subsequent step involves replacing the categorical variables with numerical labels so that we can apply our algorithms to them. To achieve this, we will utilize the LabelEncoders method.

In [26]:
le = LabelEncoder()
cat_var = cat_var.apply(le.fit_transform)
cat_var.head()

Unnamed: 0,Sex,Cabin,Embarked
0,1,47,2
1,0,81,0
2,0,47,2
3,0,55,2
4,1,47,2


##### Next we need to work on Numerical Data Frame.

In [28]:
num_var.isna().sum()


PassengerId      0
Survived         0
Pclass           0
Age            177
SibSp            0
Parch            0
Fare             0
dtype: int64

- The age column has got 177 null values
- replace that with mean

In [30]:
num_var.Age.fillna(num_var.Age.mean(), inplace=True)
num_var.isna().sum()

PassengerId    0
Survived       0
Pclass         0
Age            0
SibSp          0
Parch          0
Fare           0
dtype: int64

- PassengerId is the useless column, so we will drop it.

In [31]:
num_var.drop(['PassengerId'], axis=1, inplace=True)
num_var.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
0,0,3,22.0,1,0,7.25
1,1,1,38.0,1,0,71.2833
2,1,3,26.0,0,0,7.925
3,1,1,35.0,1,0,53.1
4,0,3,35.0,0,0,8.05


- Lets combine the preproceesed categorical and numerical dataframe

In [32]:
titanic_final = pd.concat([cat_var,num_var],axis=1)
titanic_final.head()

Unnamed: 0,Sex,Cabin,Embarked,Survived,Pclass,Age,SibSp,Parch,Fare
0,1,47,2,0,3,22.0,1,0,7.25
1,0,81,0,1,1,38.0,1,0,71.2833
2,0,47,2,1,3,26.0,0,0,7.925
3,0,55,2,1,1,35.0,1,0,53.1
4,1,47,2,0,3,35.0,0,0,8.05


## Splitting Data into Training and Test Set

We need to specify our dependent and independent variables. In this case, our dependent variable will be "Survived" because we aim to predict whether a person will survive or not. The remaining variables will be considered independent. The code for the partition is provided below.

In [37]:
X = titanic_final.drop(['Survived'], axis=1)
Y = titanic_final['Survived']

In [40]:
#Now we will be taking 80% of the data as our training set, and remaining 20% as our test set.
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=0)
len(X_train), len(Y_train), len(X_test), len(Y_test)


(712, 712, 179, 179)

- Now let’s start applying our Supervised Learning Algorithms



## Applying all the above algorithms

All of the Machine Learning models are stored in the scikit-learn (sklearn) package. We will use this package to apply all of the models mentioned above to the processed Titanic dataset and compare their accuracy scores. To start, we need to import all of the algorithms.

In [43]:
#initiaize the algorithms
LR = LogisticRegression()
KNN = KNeighborsClassifier()
NB = GaussianNB()
LSVM = LinearSVC()
NLSVM = SVC(kernel='rbf')
DT = DecisionTreeClassifier()
RF = RandomForestClassifier()

In [44]:
#next step is to train our model on our Training Data Set:
LR_fit = LR.fit(X_train, Y_train)
KNN_fit = KNN.fit(X_train, Y_train)
NB_fit = NB.fit(X_train, Y_train)
LSVM_fit = LSVM.fit(X_train, Y_train)
NLSVM_fit = NLSVM.fit(X_train, Y_train)
DT_fit = DT.fit(X_train, Y_train)
RF_fit = RF.fit(X_train, Y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Comparing the accuracy scores

In [45]:
#Now, we need to predict on the Test Data Set and compare the accuracy scores
LR_pred = LR_fit.predict(X_test)
KNN_pred = KNN_fit.predict(X_test)
NB_pred = NB_fit.predict(X_test)
LSVM_pred = LSVM_fit.predict(X_test)
NLSVM_pred = NLSVM_fit.predict(X_test)
DT_pred = DT_fit.predict(X_test)
RF_pred = RF_fit.predict(X_test)

In [47]:
print("Logistic Regression is %f percent accurate" % (accuracy_score(LR_pred, Y_test)*100))
print("KNN is %f percent accurate" % (accuracy_score(KNN_pred, Y_test)*100))
print("Naive Bayes is %f percent accurate" % (accuracy_score(NB_pred, Y_test)*100))
print("Linear SVMs is %f percent accurate" % (accuracy_score(LSVM_pred, Y_test)*100))
print("Non Linear SVMs is %f percent accurate" % (accuracy_score(NLSVM_pred, Y_test)*100))
print("Decision Trees is %f percent accurate" % (accuracy_score(DT_pred, Y_test)*100))
print("Random Forests is %f percent accurate" % (accuracy_score(RF_pred, Y_test)*100))

Logistic Regression is 79.329609 percent accurate
KNN is 72.625698 percent accurate
Naive Bayes is 80.446927 percent accurate
Linear SVMs is 75.418994 percent accurate
Non Linear SVMs is 70.949721 percent accurate
Decision Trees is 74.860335 percent accurate
Random Forests is 81.564246 percent accurate


## Conclusion

-  Random Forests have provided the most accurate predictions. 
- It is worth noting that we have not yet implemented any model improvements to enhance the accuracy.