<a href="https://colab.research.google.com/github/weidongshao/ML-tutorials/blob/main/titanic_analysis_with_chatgpt.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Exploring the Titanic Dataset with ChatGPT
This notebook guides you through the process of analyzing the Titanic dataset. We'll go through data loading, cleaning, exploratory data analysis, hypothesis testing, and modeling.



## Data Loading and Preliminary Exploration
First, we'll load the dataset and take an initial look at its structure.


In [1]:

import pandas as pd

# Load the Titanic dataset from a URL
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_data = pd.read_csv(url)

# Display the first few rows
titanic_data.head()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S



## Data Cleaning
In this section, we'll handle missing values, drop unnecessary columns, and encode categorical variables.


In [None]:

# Fill missing values in 'Age' with the median age
titanic_data['Age'].fillna(titanic_data['Age'].median(), inplace=True)

# Drop the 'Cabin' column
titanic_data.drop('Cabin', axis=1, inplace=True)

# Fill missing values in 'Embarked' with the most frequent embarkation port
most_frequent_embarked = titanic_data['Embarked'].mode()[0]
titanic_data['Embarked'].fillna(most_frequent_embarked, inplace=True)

# One-hot encode 'Sex' and 'Embarked' columns
titanic_data_encoded = pd.get_dummies(titanic_data, columns=['Sex', 'Embarked'], drop_first=True)

titanic_data_encoded.head()



## Exploratory Data Analysis (EDA)
Now, we'll explore and visualize the dataset to uncover patterns and relationships.


In [None]:

import matplotlib.pyplot as plt
import seaborn as sns

# Visualize distributions and survival rates as discussed in the analysis
# (This includes code for the distribution analysis and survival analysis visuals)

# ... (Here, you can copy the visualization code we used earlier)

# For brevity, I'm showing just one visual as an example
sns.histplot(titanic_data_encoded['Age'], kde=True, color='blue')
plt.title('Age Distribution')
plt.show()



## Hypothesis Testing
We'll statistically test hypotheses about the association between survival and both gender and class.


In [None]:

from scipy.stats import chi2_contingency

# ... (Here, you can copy the chi-squared testing code we used earlier)

# For brevity, I'm showing just the gender-based hypothesis testing as an example
contingency_gender = pd.crosstab(titanic_data_encoded['Sex_male'], titanic_data_encoded['Survived'])
chi2_stat_gender, p_val_gender = chi2_contingency(contingency_gender)[:2]

chi2_stat_gender, p_val_gender



## Modeling
We'll build a predictive model using logistic regression to predict survival based on various features.


In [None]:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Features and target variable
features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Sex_male', 'Embarked_Q', 'Embarked_S']
X = titanic_data_encoded[features]
y = titanic_data_encoded['Survived']

# Splitting the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Logistic regression model
logreg = LogisticRegression(max_iter=500)
logreg.fit(X_train, y_train)

# Predictions and accuracy
y_pred = logreg.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)

accuracy
