##                          **This is my first Kaggle Notebook.**
**This notebook explores the basic use of Pandas for data analysis and cleaning and scikit-learn for using ML algorithms for this Classifcation problem.**

**What will you find in this notebook:**
1. Exploratory Data Analysis on Titanic Dataset.
    * Understanding the Data
2. Data Preprocessing.
    * Handling missing values
    * Converting categorical features to numerical
    * Splitting Data to train and test data for ML algorithm
3. Scikit-learn basic ML algorithms.
    * implement different Classifiers from the sklearn library like Logistic regression, Gaussian naive Bayes, KNN, Decision tree, Random forest, SVM
4. Comparison of Model performances.
    * using performance metrics like confusion_matrix, accuracy_score. 

References:

* Udemy: Machine Learning A to Z: Hands-on Python & R, Team Data Science
* Krish Naik: Exploratory Data Analysis

#### If you find this useful in helping you learn some new things, please upvote and let's begin...

#### Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()

import warnings
warnings.filterwarnings("ignore", category=FutureWarning)

#### Loading the Dataset.

In [None]:
df=pd.read_csv('../input/titanic/train.csv',na_values={'Cabin':0},index_col='PassengerId')
df.head()

#### Let us begin with Exploratory Data Analysis

In [None]:
df.info()

* **There are 891 entries**
* **Columns like Age,Cabin has missing values**
* **Columns like Name, Sex, Cabin, etc has Categorical Data** 

In [None]:
df.describe()

**Out of the two labels(survived, or Died) . Let us check how many survived, how many died according to our Data?**

In [None]:
sns.countplot(x='Survived', data=df)

**Is the inference true that Females had Higher Survival Rate in this Incident??
   Let us see, If yes? Can you come up with a reason justifying this??**

In [None]:
df.groupby(['Survived','Sex'])['Survived'].count()

In [None]:
sns.catplot(x='Sex', col='Survived', kind='count', data=df)

**The reason might be that at the time of disaster, people give preferences to save Women and Children first**

**Can you come up with a guess of Survival rate with respect to Passenger Class???
  Let's just see it.**

In [None]:
pd.crosstab(df['Pclass'], df['Survived'], margins=True).style.background_gradient(cmap='autumn_r')

In [None]:
print("% of survivals in") 
print("Pclass=1 : ", df['Survived'][df['Pclass'] == 1].sum()/df[df['Pclass'] == 1]['Survived'].count())
print("Pclass=2 : ", df['Survived'][df['Pclass'] == 2].sum()/df[df['Pclass'] == 2]['Survived'].count())
print("Pclass=3 : ", df['Survived'][df['Pclass'] == 3].sum()/df[df['Pclass'] == 3]['Survived'].count())

**You got the guess right. The Survival Rate of those people is higher who had tickets of Higher class(i.e. class 1 > class 2, etc.), the reason is again understandable. Maybe the people having tickets of higher class were rich and able to make an escape first by offering money to rescuers.**

In [None]:
pd.crosstab([df['Sex'], df['Survived']], df['Pclass'], margins=True).style.background_gradient(cmap='autumn_r')

#### Data Cleaning

In [None]:
df = pd.read_csv('../input/titanic/train.csv',na_values={'Cabin':0})
df_test = pd.read_csv('../input/titanic/test.csv',na_values={'Cabin':0})
df_apply = df.copy()

In [None]:
df_apply_x = pd.get_dummies(df_apply, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)
df_apply_x = df_apply_x.drop(['PassengerId','Name','Ticket','Cabin'],axis=1)
df_apply_y = df_apply['Survived']

df_test = pd.get_dummies(df_test, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)
submission = pd.DataFrame()
submission['PassengerId'] = df_test['PassengerId']
df_test = df_test.drop(['PassengerId','Name','Ticket','Cabin'],axis=1)

In [None]:
df_apply_x.head()

In [None]:
df_apply_y.head()

**Replacing the NaN values in Age column by median of this column and NaN values in cabin by 0.**

In [None]:
median=df_apply_x['Age'].median()
df_apply_x['Age'].fillna(median,inplace=True)
#df_apply_x['Cabin'][df_apply_x['Cabin']!=1]=0
df_apply_x.head()

In [None]:
median_test=df_test['Age'].median()
df_test['Age'].fillna(median,inplace=True)
#df_apply_x['Cabin'][df_apply_x['Cabin']!=1]=0
df_test.head()

In [None]:
df_apply_x.dropna(inplace=True)
df_apply_x.head()

In [None]:
df_apply_x.info()

In [None]:
df_test.info()

**Correlation Matrix**

In [None]:
corr = df_apply_x.corr()

f,ax = plt.subplots(figsize=(9,6))
sns.heatmap(corr, annot = True, linewidths=1.5 , fmt = '.2f',ax=ax)
plt.show()

**Survived and Fare are positively correlated, Survived and Sex_male negatively correlated.
Also, Survived and Pclass_3 negatively correlated.
If you need to Brush up about what Correlation and Covariance are, just go to StatQuest Youtube Channel, that Guy is just Awesome. BAM!!!**

**Converting data frames into arrays, since our sklearn library ML models takes input in the form of arrays**

In [None]:
df_apply_x.fillna(df_apply_x.mean(), inplace=True)
df_apply_x = df_apply_x.drop(['Survived'],axis=1)
X=df_apply_x.values
Y=df_apply_y.values

Removed Label column from X.

Splitting the data into test and training set

In [None]:
#from sklearn.model_selection import train_test_split
#X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, random_state = 1)
X_train = X
Y_train = Y

In [None]:
#print(X_train)
print(len(X_train))
#print(X_test)
#print(len(X_test))

In [None]:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)

In [None]:
# for df_test_ml
df_test.fillna(df_test.mean(), inplace=True)
# scaler.fit(df_test_ml)
scaled_features = sc.transform(df_test)
X_test = pd.DataFrame(scaled_features, columns=df_test.columns)

**Scikit-learn basic ML algorithms**
* KNN
* Logistic Regression
* Naive Bayes
* SVM
* Descicon Tree
* Random Forest

**Training KNN  model**

In [None]:
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.neighbors import KNeighborsClassifier
classifier_KNN = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
classifier_KNN.fit(X_train, Y_train)

In [None]:
Y_pred_KNN = classifier_KNN.predict(X_test)
#print(confusion_matrix(Y_test, Y_pred_KNN))
#print(classification_report(Y_test, Y_pred_KNN))
#print("Accuracy: ",accuracy_score(Y_test, Y_pred_KNN)*100, "%")

**Training Logistic Regression Model**

In [None]:
from sklearn.linear_model import LogisticRegression
classifier_LR = LogisticRegression()
classifier_LR.fit(X_train,Y_train)
Y_pred_LR = classifier_LR.predict(X_test)
print(confusion_matrix(Y_test, Y_pred_LR))
print(classification_report(Y_test, Y_pred_LR))
print("Accuracy: ",accuracy_score(Y_test, Y_pred_LR)*100 ,"%")

**Gaussian Naive Bayes**

In [None]:
from sklearn.naive_bayes import GaussianNB
classifier_GNB = GaussianNB()
classifier_GNB.fit(X_train,Y_train)
Y_pred_GNB = classifier_GNB.predict(X_test)
print(confusion_matrix(Y_test,Y_pred_GNB ))
print(classification_report(Y_test, Y_pred_GNB))
print("Accuracy: ",accuracy_score(Y_test, Y_pred_GNB)*100," %")

**Descision Tree**

In [None]:
from sklearn.tree import DecisionTreeClassifier
classifier_DT= DecisionTreeClassifier()
classifier_DT.fit(X_train,Y_train)
Y_pred_DT = classifier_DT.predict(X_test)
print(classification_report(Y_test,Y_pred_DT))
print(accuracy_score(Y_test, Y_pred_DT))

**Random forest**

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier_RF = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
classifier_RF.fit(X_train, Y_train)
Y_pred_RF = classifier_RF.predict(X_test)
print(classification_report(Y_test,Y_pred_RF))
print("Accuracy: ",accuracy_score(Y_test, Y_pred_RF)*100, " %")

**SVM**

In [None]:
from sklearn.svm import SVC
classifier_SVC = SVC(gamma = 0.01, C = 100)#, probability=True)
classifier_SVC.fit(X_train, Y_train)
Y_pred_SVC = classifier_SVC.predict(X_test)
#print(classification_report(Y_test,Y_pred_SVC))
#print("Accuracy: ",accuracy_score(Y_test, Y_pred_SVC)*100, " %")
submission['Survived'] = Y_pred_SVC
submission.to_csv('submission.csv',index=False)

**Thank You!!!**