# **1 Introduction**

This is my first attempt at the Titanic competition after learning about Machin learning. 
In this notebook, I got help with some ideas in the below notebooks:

* https://www.kaggle.com/gunesevitan/titanic-advanced-feature-engineering-tutorial

* https://www.kaggle.com/code/khashayarrahimi94/knn-xgboost-svc-ensemble-with-just-5-feature

**1-1 Import the Libraries and Dataset**

In [None]:
import numpy as np 
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.impute import KNNImputer
import warnings
warnings.filterwarnings('ignore')

In [None]:
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [None]:
Train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
Test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
All_data = pd.concat([Train_data, Test_data], sort=True).reset_index(drop=True)

**1-2 Data Overview** 

In [None]:
print('Training Shape = {}'.format(Train_data.shape))
print('Test Shape = {}'.format(Test_data.shape))
print('Name of columns in Training dataframe = {}'.format(Train_data.columns))
print('Name of columns in Test dataframe = {}'.format(Test_data.columns))

In [None]:
Train_data.head()

In [None]:
Train_data.info()

In [None]:
Test_data.head()

In [None]:
Test_data.info()

# **2 Exploratory Data Analysis**

**2-1 missing value**

To better determine the correlation of the features and increase the accuracy of the model, it is necessary to fill the NAN values related to each feature correctly.

In [None]:
print('missing values of Train ')
print('\n')
for column in Train_data.columns.tolist():          
    print('{} column: {}'.format(column, Train_data[column].isnull().sum()))

In [None]:
print('missing values of Test ')
print('\n')
for column in Test_data.columns.tolist():          
    print('{} column: {}'.format(column, Test_data[column].isnull().sum()))

**2-1-1 Age**

Examination of the correlation between age and other features shows that there is a closer correlation between Sex and Pclass feature with age. So the NaN values are filled as follows.

In [None]:
All_data['Age'] = All_data.groupby(['Sex', 'Pclass'])['Age'].apply(lambda x: x.fillna(x.median()))

**2-1-2 Embarked**

There are only two passengers in this data set whose Embarked value is NaN. Since some of their information such as Cabin and Ticket are the same, they probably know each other, so their Embarked is the same. The result of grouping the data by Embarked and Sex shows that most of the females have boarded at Southampton port, so we fill in the NaN values with S.

In [None]:
All_data[All_data['Embarked'].isnull()]

In [None]:
All_data.groupby(['Embarked','Sex'])['Sex'].count()

In [None]:
All_data['Embarked'] = All_data['Embarked'].fillna('S')

**2-1-3 Fare**

In [None]:
All_data[All_data['Fare'].isnull()]

In [None]:
Class = All_data.groupby(['Pclass'])['Fare'].mean()
All_data['Fare'] = All_data['Fare'].fillna(Class[3])

**2-1-4 Cabin**

In [None]:
check_nan = All_data['Cabin'].isnull()
All_data['newCabin']=np.where(check_nan == False, All_data['Cabin'].astype(str).str[0],0)
All_data['newCabin']

In this notebook, I just filled the NaN values of the features and did not add any new features to the dataset. Since feature engineering is very effective in increasing the accuracy of the model, more attention will be paid to it in future efforts.

**2-2 Correlation**

In [None]:
All_data.head()

In [None]:
All_data.drop(['Ticket','Cabin','Name'], axis=1, inplace=True)

In [None]:
All_data['Sex'] = pd.factorize(All_data['Sex'])[0]
All_data['Embarked'] = pd.factorize(All_data['Embarked'])[0]
All_data['newCabin'] = pd.factorize(All_data['newCabin'])[0]

In [None]:
Train = All_data.head(891)
Test = All_data.tail(418)
Test.drop(['Survived'], axis=1, inplace=True)

In [None]:
plt.figure(figsize=(12,10))
cor_Train = Train.corr()
sns.heatmap(cor_Train, annot=True, cmap=plt.cm.Reds)
plt.show()

In [None]:
plt.figure(figsize=(12,10))
cor_Test = Test.corr()
sns.heatmap(cor_Test, annot=True, cmap=plt.cm.Reds)
plt.show()

# **3 Model**

In [None]:
new_Features = ["Pclass", "Sex", "Fare", "SibSp", "Parch", "Age","Embarked"]

X = Train[new_Features]
y = Train['Survived']
#X_train, X_Val, y_train, y_Val = train_test_split(X, y, test_size=0.33, random_state=1)
models = []
models.append(('LR', LogisticRegression()))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('CART', DecisionTreeClassifier()))
models.append(('Extra', ExtraTreesClassifier()))
models.append(('NB', GaussianNB()))
models.append(('SVM', SVC()))

results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = KFold(n_splits=10)
    cv_results = cross_val_score(model, X, y, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    mods = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(mods)

In [None]:
X = pd.get_dummies(Train[new_Features])
X_Test = pd.get_dummies(Test[new_Features])

model = LinearRegression()
model.fit(X, y)
predictions = model.predict(X_Test)
for i in range(predictions.size):
    if predictions[i]>=0.55:
        predictions[i]=1
    else:
        predictions[i]=0
        
predictions=predictions.astype('int')

output = pd.DataFrame({'PassengerId': Test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")