In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed


import numpy as np # linear algebra
import pandas as pd # data processing

# data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

Here, we will be working with the famous [Titanic Dataset from Kaggle](https://www.kaggle.com/c/titanic). We’ll be trying to predict a classification: survival or deceased. We’ll use a “semi-cleaned” version of the titanic dataset. Also, we have made two separate files for training and testing. The training data can be found by clicking [here](https://github.com/meetnandu05/LogisticRegression/blob/master/titanic_train.csv) and the test data can be found by clicking [here](https://github.com/meetnandu05/LogisticRegression/blob/master/titanic_test.csv).

## Get the Data

First, let us now import the data and let’s look at how this data looks.

In [1]:
train = pd.read_csv('/kaggle/input/titanic_train.csv')
test = pd.read_csv('/kaggle/input/titanic_test.csv')

train.head()

## Exploratory Data Analysis

Let’s begin some exploratory data analysis. 

We’ll start by checking out missing data. We can use seaborn to create a simple heatmap to see where we are missing data.

In [1]:
train.isnull().sum()

In [1]:
ax = sns.heatmap(train.isnull())

In [1]:
ax = sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')

Roughly 20 percent of the Age data is missing. The proportion of Age missing is likely small enough for reasonable replacement with some form of imputation. 

Looking at the Cabin column, it looks like we are just missing too much of that data to do something useful with at a basic level. We’ll probably drop this or change it to another feature like “Cabin Known: 1 or 0”.

Let’s continue on by visualizing some more of the data. 

Let’s show the count of people who survived.

In [1]:
sns.countplot(data=train, x='Survived')

Let’s show the count of males and females survived.

In [1]:
sns.countplot(data=train, x= 'Survived', hue='Sex')

## Data Cleaning

We want to fill in missing age data instead of just dropping the missing age data rows. 

One way to do this is by filling in the mean age of all the passengers (imputation). However we can be smarter about this and check the average age by passenger class. 

Let's show the average age of people belonging to different class in the ship.

In [1]:
sns.boxplot(data=train, x='Pclass', y='Age')

We can see the wealthier passengers in the higher classes tend to be older, which makes sense. We’ll use these average age values to impute based on Pclass for Age.

In [1]:
train_age_nill = train[train.Age.isnull()].PassengerId.values

In [1]:
train_age_nill

In [1]:
train.Age.fillna(train.groupby('Pclass').Age.transform("median"), inplace=True)

In [1]:
train[train.PassengerId.isin(train_age_nill)].head()

In [1]:
test_age_nill = test[test.Age.isnull()].PassengerId.values
test_age_nill

In [1]:
pclass_mean = train.groupby('Pclass').Age.median().to_dict()
pclass_mean


In [1]:

def age_map(x):
    Age = x[0]
    Pclass = x[1]
    if pd.isnull(Age):
        return pclass_mean[Pclass]
    else:
        return Age
#Alternativas
#test['Age'] = test[['Age', 'Pclass']].apply(lambda x: age_map(x), axis = 1)

#test['Age'] = test[['Age','Pclass']].apply(age_map,axis=1)

#test.loc[test.Age.isnull(),'Age'] = test.Pclass.map(pclass_mean)

test.Age.fillna(test.Pclass.map(pclass_mean), axis =0, inplace=True)

In [1]:
test[test.PassengerId.isin(test_age_nill)].head(10)

Now let’s check that heat map again.

In [1]:
ax = sns.heatmap(train.isnull(), yticklabels=False, cbar=False, cmap='viridis')

We can now see there are no null values in the Age column.

Let’s go ahead and drop the Cabin column and the row in Embarked that is NaN.

In [1]:
train.drop('Cabin', axis=1, inplace=True)
test.drop('Cabin', axis=1, inplace=True)

## Converting Categorical Features

We’ll need to convert categorical features to dummy variables. Otherwise the machine learning algorithm won’t be able to directly take in those features as inputs.

In [1]:
train.info()

As we can see, there are 4 categorical columns namely Name, Sex, Ticket and Embarked. Out of these 4, the Name and Ticket column have no relationship with whether the person survived or not. 

So we drop these 2 columns and we convert the other two columns into numerical values. Then the data will be ready for the model.

In [1]:
train.drop('Name', axis=1, inplace= True)
train.drop('Ticket', axis=1, inplace = True)
test.drop('Name', axis=1, inplace= True)
test.drop('Ticket', axis=1, inplace = True)

train_objs_num = len(train)
dataset = pd.concat(objs=[train, test], axis=0, sort = False)

dataset_preprocessed = pd.get_dummies(dataset, drop_first = True)

train = dataset_preprocessed[:train_objs_num]
test = dataset_preprocessed[train_objs_num:]
train.head()


In [1]:
train.info()

In [1]:
fig, ax = plt.subplots(figsize=(10,10))  
sns.heatmap(train.corr(), annot=True, cmap='viridis', ax=ax)

In [1]:
from sklearn.decomposition import PCA
pca =  PCA(n_components=1)

df_1 = train[['Fare','Pclass']]
col_1 = pca.fit_transform(df_1)

#calcula la mediana de Fare para cada Pclass
pclass_fare = train.groupby('Pclass').Fare.median().to_dict()
pclass_fare

#mapea los Fare null, con la mediana de Fare según la Pclass
test.Fare.fillna(test.Pclass.map(pclass_fare), axis =0, inplace=True)

test = test.drop(['Survived'], axis=1)

df_2 = test[['Fare', 'Pclass']]
col_2 = pca.transform(df_2)

train.insert(2,'Fare_Pclass', col_1[:,0], True)
test.insert(2,'Fare_Pclass', col_2[:,0], True)

train=train.drop(['Fare','Pclass'], axis=1)
test=test.drop(['Fare','Pclass'], axis=1)


## Logistic Regression

Logistic regression is a statistical method for predicting binary classes. The outcome or target variable is dichotomous in nature. Dichotomous means there are only two possible classes. 

The real life example of classification example would be, to categorize the mail as spam or not spam, to categorize the tumor as malignant or benign and to categorize the transaction as fraudulent or genuine. All these problem’s answers are in categorical form i.e. Yes or No. and that is why they are two class classification problems.

## Building a Logistic Regression model

Let’s start by splitting the data into a training  and test. 

NOTE: There is another test file that we can play around with in case we want to use all this data for training.

In [1]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =  train_test_split(train.drop('Survived', axis = 1), 
                                                     train.Survived, 
                                                     test_size = 0.20, random_state = 0)

Now, let’s move on to train the model  and predict using it.

In [1]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()

X_train_std = standard_scaler.fit_transform(X_train)
X_test_std = standard_scaler.transform(X_test)




In [1]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test =  train_test_split(train.drop('Survived', axis = 1), 
                                                     train.Survived, 
                                                     test_size = 0.20, random_state = 0)

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler

standard_scaler = StandardScaler()

X_train_std = standard_scaler.fit_transform(X_train)
X_test_std = standard_scaler.transform(X_test)


lr = LogisticRegression(solver = 'lbfgs', penalty='l2', C=0.1)
from sklearn.metrics import classification_report
from sklearn import metrics

lr.fit(X_train_std, y_train)

y_predict = lr.predict(X_test_std)


print(classification_report(y_test,y_predict))
print("Accuracy:", metrics.accuracy_score(y_test, y_predict))