## Titanic Dataset

- The Titanic dataset is a collection of data that contains information about the passengers who were on board the Titanic when it sank in 1912.
- It includes details such as the passengers' age, gender, ticket class, and whether or not they survived the disaster.
- The dataset is often used in data analysis and machine learning to explore patterns and relationships in the data.
- It can also be used to develop predictive models based on the passengers' characteristics and their likelihood of survival.

## Columns Details

- PassengerId: a unique identifier for each passenger
- Survived: indicates whether the passenger survived the sinking (0 = did not survive, 1 = survived)
- Pclass: the passenger's ticket class (1 = first class, 2 = second class, 3 = third class)
- Name: the passenger's name
- Sex: the passenger's gender
- Age: the passenger's age in years
- SibSp: the number of siblings or spouses the passenger had aboard the ship
- Parch: the number of parents or children the passenger had aboard the ship
- Ticket: the passenger's ticket number
- Fare: the fare the passenger paid for their ticket
- Cabin: the passenger's cabin number
- Embarked: the port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

## Read the Data set and look at basic statistics 
Source : https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

In [6]:
import pandas as pd
titanic = pd.read_csv('https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv')
titanic.describe()

Unnamed: 0,Survived,Pclass,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
count,887.0,887.0,887.0,887.0,887.0,887.0
mean,0.385569,2.305524,29.471443,0.525366,0.383315,32.30542
std,0.487004,0.836662,14.121908,1.104669,0.807466,49.78204
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.25,0.0,0.0,7.925
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.1375
max,1.0,3.0,80.0,8.0,6.0,512.3292


## Determine the count of surviving and non-surviving passengers.







In [7]:
survived = titanic['Survived'].value_counts()[1]
not_survived = titanic['Survived'].value_counts()[0]
print("survived :", survived)
print("not_survived :", not_survived)
print("Total Passesngers :", survived+not_survived)


survived : 342
not_survived : 545
Total Passesngers : 887


## Identify columns with low correlation to the target variable (in this case, the "Survived" column)

#### This will display the correlation coefficients between each column and the "Survived" column in descending order. Columns with low correlation to the target variable can be considered irrelevant and potentially removed from the dataset.

In [8]:
corr_matrix = titanic.corr()
abs(corr_matrix['Survived']).sort_values(ascending=False)

Survived                   1.000000
Pclass                     0.336528
Fare                       0.256179
Parents/Children Aboard    0.080097
Age                        0.059665
Siblings/Spouses Aboard    0.037082
Name: Survived, dtype: float64

## Training a simple machine learning model 

- The logistic regression algorithm looks at the relationship between the outcome we want to predict and the factors we think might influence that outcome. It calculates a probability of the outcome based on those factors. If the probability is greater than a certain threshold, we predict one outcome, and if it's less than the threshold, we predict the other outcome.

- The "logistic" part of logistic regression refers to the mathematical function that is used to calculate the probabilities. This function makes sure that the probabilities are always between 0 and 1, which is necessary for a binary outcome.







- Load the Titanic dataset: The Titanic dataset contains information about passengers on the Titanic, including whether they survived or not. We can load this dataset into a pandas dataframe using the read_csv function.

- Preprocess the data: We may need to preprocess the data, such as filling in missing values, converting categorical variables to numerical, and normalizing numerical variables.

- Split the data: We split the data into training and testing sets using the train_test_split function from scikit-learn. This ensures that we have a separate set of data to evaluate our model.

- Train the model: We train a logistic regression model using the training data. This involves finding the best set of coefficients that can predict whether a passenger survived or not based on their features (such as age, sex, and fare).

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

#preprocess
titanic.fillna(titanic.mean(), inplace=True)
titanic.dropna(inplace=True)

# Convert Sex column from string to discrete
titanic['Sex'] = titanic['Sex'].map({'male': 0, 'female': 1})

X = titanic[['Age', 'Sex', 'Pclass']]
y = titanic['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

model = LogisticRegression()
model.fit(X_train, y_train)

LogisticRegression()

## Prediction and Evaluation

- Make predictions: We use the trained model to make predictions on the test data. This involves calculating the probability that each passenger survived, based on their features.

- Evaluate the model: We evaluate the performance of the model using a confusion matrix, which shows how many true positives, true negatives, false positives, and false negatives the model made. This can give us an idea of how well the model is able to predict survival.Precision, recall, F1-score, and support are metrics used to evaluate the performance of a classification model.

- Precision refers to the percentage of correctly identified positive cases out of all cases that were identified as positive. In other words, it measures how precise the model is in identifying positive cases.

- Recall, on the other hand, refers to the percentage of correctly identified positive cases out of all actual positive cases. In other words, it measures how comprehensive the model is in identifying positive cases.

- F1-score is the harmonic mean of precision and recall, which takes into account both measures. It provides a single value that summarizes the overall performance of the model.

- Support is the number of samples in each class, which is useful to see if the dataset is balanced or imbalanced.

In summary, precision, recall, and F1-score are important metrics for evaluating the accuracy and effectiveness of a classification model, while support provides additional information about the balance of the dataset.

In [10]:
from sklearn.metrics import classification_report

y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))


              precision    recall  f1-score   support

           0       0.82      0.83      0.83       117
           1       0.67      0.66      0.66        61

    accuracy                           0.77       178
   macro avg       0.74      0.74      0.74       178
weighted avg       0.77      0.77      0.77       178

