<a href="https://colab.research.google.com/github/ramprasadlg1/Data/blob/main/LogisticRegression_Titanic_Dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Logistic Regression**
**Logistic regression is an example of supervised learning. It is used to calculate or predict the probability of a binary (yes/no) event occurring**
##**Titanic Dataset**


Import the required packages


In [1]:
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import warnings
warnings.filterwarnings("ignore")




Read the data set

In [2]:
data=pd.read_csv("https://raw.githubusercontent.com/insaid2018/Term-1/master/Data/Casestudy/titanic_train.csv")

Understand the data using shape

In [3]:
data.shape

(891, 12)

In [4]:
data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
data.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Handling the missing values

In [6]:
median_age=data.Age.median()
#mean_age=data.Age.mean()
#mean_age
median_age


28.0

Fill missing values

In [7]:
data['Embarked'].mode()
data['Embarked'].mode()[0]


'S'

In [8]:
data.Age.fillna(median_age, inplace=True)
data.Embarked=data.Embarked.fillna(data['Embarked'].mode()[0])

In [9]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     891 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Mode of Embarked

In [10]:
data['Embarked'].mode()


0    S
Name: Embarked, dtype: object

Drop redundent columns

In [11]:
data.drop(['PassengerId','Cabin'], axis=1, inplace=True)

In [12]:
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S


In [13]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       891 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(4)
memory usage: 69.7+ KB


##Feature Engineering
We need a new column to have the size of the family
we will add SibSp and Parch + 1 (this for the persion in question)

In [14]:
data['familySize']=data['SibSp']+data['Parch']+1
data.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,familySize
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,2
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,2
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,1
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,2
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,1


#Exploratory Data Analysis

Waht is the frequency and proportion of survived

In [15]:
data.Survived.value_counts()

survived_percentage = (data['Survived'].value_counts()[1] / data['Survived'].value_counts().sum())*100
survived_percentage

38.38383838383838

drop unwanted columns
we don't need Name, Ticket, SibSp and Parch


In [16]:
data=data.drop(['Name','Ticket','SibSp','Parch'], axis = 1)


The final columns after dropping unwanted

In [17]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,Fare,Embarked,familySize
0,0,3,male,22.0,7.25,S,2
1,1,1,female,38.0,71.2833,C,2
2,1,3,female,26.0,7.925,S,1
3,1,1,female,35.0,53.1,S,2
4,0,3,male,35.0,8.05,S,1


In [18]:
data.info


<bound method DataFrame.info of      Survived  Pclass     Sex   Age     Fare Embarked  familySize
0           0       3    male  22.0   7.2500        S           2
1           1       1  female  38.0  71.2833        C           2
2           1       3  female  26.0   7.9250        S           1
3           1       1  female  35.0  53.1000        S           2
4           0       3    male  35.0   8.0500        S           1
..        ...     ...     ...   ...      ...      ...         ...
886         0       2    male  27.0  13.0000        S           1
887         1       1  female  19.0  30.0000        S           1
888         0       3  female  28.0  23.4500        S           4
889         1       1    male  26.0  30.0000        C           1
890         0       3    male  32.0   7.7500        Q           1

[891 rows x 7 columns]>

In [19]:
data.dtypes

Survived        int64
Pclass          int64
Sex            object
Age           float64
Fare          float64
Embarked       object
familySize      int64
dtype: object

Conver the data type of columns
PClass is a categorical feature so we need to convert it to an object data type


In [20]:
data['Pclass'] = data.Pclass.astype(dtype=object)

In [21]:
data.dtypes

Survived        int64
Pclass         object
Sex            object
Age           float64
Fare          float64
Embarked       object
familySize      int64
dtype: object

Create the dummy variables for categorical variables
Gender, Pclass and Embarked are categorical variables

*get_dummies* will create the dummy variables for each value of the categorical variable

*drop_first* will drop the first one for n-1


In [22]:
encoded_data= pd.get_dummies(data, drop_first=True)
encoded_data.head()

Unnamed: 0,Survived,Age,Fare,familySize,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,0,22.0,7.25,2,0,1,1,0,1
1,1,38.0,71.2833,2,0,0,0,0,0
2,1,26.0,7.925,1,0,1,0,0,1
3,1,35.0,53.1,2,0,0,0,0,1
4,0,35.0,8.05,1,0,1,1,0,1


Value count of Embarked

In [23]:
encoded_data.head()

Unnamed: 0,Survived,Age,Fare,familySize,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,0,22.0,7.25,2,0,1,1,0,1
1,1,38.0,71.2833,2,0,0,0,0,0
2,1,26.0,7.925,1,0,1,0,0,1
3,1,35.0,53.1,2,0,0,0,0,1
4,0,35.0,8.05,1,0,1,1,0,1


In [24]:
data.Embarked.value_counts()

S    646
C    168
Q     77
Name: Embarked, dtype: int64

#Data Preparation
We need to split our data in to training and testing for further development of model

In [25]:
X=encoded_data.drop('Survived', axis = 1)
y = encoded_data['Survived']


In [26]:
X.head()

Unnamed: 0,Age,Fare,familySize,Pclass_2,Pclass_3,Sex_male,Embarked_Q,Embarked_S
0,22.0,7.25,2,0,1,1,0,1
1,38.0,71.2833,2,0,0,0,0,0
2,26.0,7.925,1,0,1,0,0,1
3,35.0,53.1,2,0,0,0,0,1
4,35.0,8.05,1,0,1,1,0,1


We need to separate the train test data
use as random number as Id for random_state

In [27]:
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=0.20, random_state = 42)

print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)

print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

X_train shape:  (712, 8)
y_train shape:  (712,)
X_test shape:  (179, 8)
y_test shape:  (179,)


#Model development and evaluation
We will develop logistic regression using input features and tune our model as needed

We will Analyze the results and make the observations



##Logistic regression


In [28]:
logreg=LogisticRegression()
logreg.fit(X_train, y_train)

y_pred_test = logreg.predict(X_test)

In [30]:
acc=accuracy_score(y_test, y_pred_test)
print (acc)

prec = precision_score (y_test, y_pred_test)
print (prec)

rec=recall_score(y_test, y_pred_test)
print (rec)

f1=f1_score(y_test, y_pred_test)
print (f1)


0.8044692737430168
0.7910447761194029
0.7162162162162162
0.7517730496453902
