# Titanic Problem
The following is a primitive attempt at solving the [Titanic](https://www.kaggle.com/c/titanic) Problem statement on Kaggle.
The code performs very basic data cleaning and analysis.

The processed data was passed through Logistic Regression feature of sklearn



In [None]:
# file access
import os

# linear algebra
import numpy as np
 
# data processing
import pandas as pd
 
# data visualization
import matplotlib.pyplot as plt
import seaborn as sns
 
# Algorithms
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier

## Obtaining data from csv

In [None]:
InitialTraindata = pd.read_csv("../input/train.csv")
InitialTraindata.info()

## Data Processing

From the above analysis, we realize that the certain cells in 'Age' and 'Cabin' columns are missing.
To counter the lack in data:

* Entirely remove the column of Cabin as it is missing way too many slots
* Fill empty cells of Age with the mean of remaining values which to compute to 23.7 



In [None]:
InitialTraindata = InitialTraindata.drop(columns='Cabin')
InitialTraindata.info()
InitialTraindata['Age'].describe()
InitialTraindata['Age'] = InitialTraindata['Age'].fillna(23.7)

## Data Analysis

The different aspects of the passengers are being compared with respect to their chances for survival
This exercises, enables us to decide the features to be included and ignored.   

We utilize libraries like seaborn and matplotlib to understand the relation between the data

In [None]:
sns.set_style('whitegrid')
sns.countplot(x='Survived', hue='Sex', data= InitialTraindata,palette='RdBu_r')


In [None]:
sns.countplot(x='Survived', hue='Pclass', data= InitialTraindata,palette='RdBu_r')


In [None]:
InitialTraindata['Fare'].hist(color='green',bins=40,figsize=(8,4))

In [None]:
survived = InitialTraindata[InitialTraindata['Survived']==1]

survived.Sex.value_counts()
survived.Pclass.value_counts()

## Creating Test and Train sets

After playing around with the role of the above features in deciding the fate of a person's survival, I've come to an understanding there is a significant variance in the survival rates depending on 4 main aspects which are:
* Age
* Sex
* Ticket Class
So, we create an exclusive training dataset containing only the relavant columns specified above.

**NOTE:** I've done only surface level analysis of provided data for now, there is a huge scope of improving accuracy with more systemic analysis of provided data.

In [1]:
train = pd.concat([InitialTraindata['Age'],InitialTraindata['Sex'],InitialTraindata['Pclass'],InitialTraindata['Survived'],InitialTraindata['Fare']],axis=1)
'''
It is necessary to replace the string values for numeric processing in logistic regression 
'''

train['Sex'].replace('female',0,inplace=True)
train['Sex'].replace('male',1,inplace=True)


NameError: name 'pd' is not defined

In [None]:
Train_x, Test_x, Train_y, Test_y = train_test_split(train.drop(columns='Survived'), train['Survived'], test_size = 0.2, random_state = 0)
Train_x.info()

## Training Model

Created a simple logisticRegression model using sklearn Python libray.
Using the Training data and Testing data extracted from above for training and checking the accuracy of the method.

In [None]:
#using regression model from sklearn

model = LogisticRegression(solver='lbfgs')
model.fit(Train_x,Train_y)
prediction = model.predict(Test_x)
print(model.score(Train_x, Train_y))
print(accuracy_score(Test_y, prediction))


## Processing Test data

Predicting the output for the data in the official Test.csv using the trained model.
The predictions have to be further formatted to the required format before storing the file.

In [None]:
evalset = pd.read_csv('../input/test.csv')
evalset['Age'] = evalset['Age'].fillna(23.7)
evalset['Fare'] = evalset['Fare'].fillna(evalset['Fare'].mean())
evalset['Sex'].replace('female',0,inplace=True)
evalset['Sex'].replace('male',1,inplace=True)   
testset = pd.concat([evalset['Age'],evalset['Sex'],evalset['Pclass'],evalset['Fare']],axis=1)
final = model.predict(testset)

In [None]:
final.reshape(418,1)
opcsv = pd.DataFrame({'PassengerId':evalset['PassengerId'] , 'Survived':final[:]})

opcsv.to_csv('Submission.csv',index=False)
