Part 1: Load Cleaned Data + Split for Training

In [14]:
import pandas as pd

df=pd.read_csv("titanic_cleaned.csv")

#Question:: How many different files can we read?


📂 Structured Data Files
Format	Function	Example Use
CSV	pd.read_csv()	Most common flat file for datasets
Excel (XLSX)	pd.read_excel()	Reports, tables from businesses
JSON	pd.read_json()	APIs or nested data (web scraping)
HTML	pd.read_html()	Tables from websites
SQL	pd.read_sql()	Pull data directly from databases
Parquet	pd.read_parquet()	Big data, fast columnar storage
Pickle	pd.read_pickle()	Python-native serialized objects
ORC	pd.read_orc()	Big data, like Parquet (less common)
Feather	pd.read_feather()	Fast I/O, good for ML pipelines
Clipboard	pd.read_clipboard()	Quick copy-paste of tabular data
Text	pd.read_table()	Text files with delimiters

In [15]:
df.head()
df = df.drop('Embarked_C',axis = 1)

In [16]:
from sklearn.model_selection import train_test_split

#Define features and target
X= df.drop('Survived',axis = 1)
y = df['Survived']
print(X.shape)
print(y.shape)

#Question: do we always define features and target before creating a model? 2) do we always drop this target variable from X 3)is there any thumb
#rule for defining features and targets 4) why do we use axis = 1 4)explain more about sklearn.model_selection and train_test_split

(889, 9)
(889,)


 Q1: Do we always define features and target before creating a model?
✅ YES — Always.
Every supervised ML model needs:

X (features) → What the model uses to make predictions

y (target/label) → What the model is trying to predict

This is the core of supervised learning.

✅ Q2: Do we always drop the target from X?
✅ YES — Always.
If you keep y (target column like Survived) in your features:

The model “cheats” — it's like giving it the answer during training

You'll get super high accuracy — but it's completely fake

So this line is a must:

python
Copy
Edit
X = df.drop('Survived', axis=1)
✅ Q3: Is there a thumb rule for defining features and targets?
✅ YES — Here's the logic:
Type	Meaning	Example
Target (y)	The column you want to predict	Survived, Price, Churn, IsSpam
Features (X)	The data used to make predictions	Everything except the target

👉 You exclude columns like:

IDs

Target label

Columns you wouldn't know in real time (like future info)

✅ Q4: Why axis=1?
In pandas:

Axis	Meaning
axis=0	Operate row-wise
axis=1	Operate column-wise ✅

So:

python
Copy
Edit
df.drop('Survived', axis=1)
Means: drop the 'Survived' column, not a row.

✅ Q5: What is train_test_split and sklearn.model_selection?
📦 sklearn.model_selection
This is the module in Scikit-learn where all the model selection tools live:

train_test_split
cross_val_score
GridSearchCV
StratifiedKFold
RandomizedSearchCV

📌 train_test_split does exactly what it says:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
Parameter	Meaning
X, y	Your features and target
test_size=0.2	20% of the data goes into test set
random_state=42	Ensures same split every time (for reproducibility)

🎯 Why We Do This:
To train on one part of the data (X_train, y_train)

And test generalization on unseen data (X_test, y_test)

This simulates real-world usage.

In [17]:
#train_test_split
X_train,X_test,y_train,y_test = train_test_split(X , y, test_size = 0.2, random_state = 42)

#Questions: 1) what are we achieving here? 2)Why X capital y small 3)Explain parameters for train_test_split

✅ Q1: What are we achieving with train_test_split?
🎯 Purpose:
We are splitting the dataset into two parts:

Training set: Used to train the model (X_train, y_train)

Testing set: Used to evaluate how well the model performs on unseen data (X_test, y_test)

💡 Why it’s important:
If you train and test on the same data, the model may memorize instead of learning → leads to overfitting.

✅ train_test_split simulates the real world, where you don’t get to see test data during training.

✅ Q2: Why is X capital and y small?
It’s just a convention, but here’s the meaning:

Symbol	Meaning
X (uppercase)	A matrix of features — many rows and columns
y (lowercase)	A vector (single column) — your target label

✅ This matches linear algebra conventions from math/ML.

Example:
python
Copy
Edit
X.shape → (800, 6)   # 800 samples, 6 features  
y.shape → (800,)     # 800 target values
✅ Q3: Explain train_test_split() Parameters
python
Copy
Edit
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42
)
Parameter	Purpose
X, y	Features and target
test_size=0.2	20% of the data goes to testing, 80% to training
train_size=0.8	(Optional) — works with test_size
random_state=42	Sets the random seed → ensures the same split every time (for reproducibility)
shuffle=True (default)	Randomly shuffles data before splitting

🎯 Why random_state=42?
If you don’t set this, every run splits differently → hard to debug or compare.

By setting a number (any number, 42 is a meme 😄), you get consistent, repeatable splits.

🧠 TL;DR Summary:
train_test_split helps prevent overfitting by giving your model a realistic test

X = matrix of features, y = vector of targets

test_size=0.2 = hold out 20% for testing

random_state=42 = ensure consistent split

✅ Always split before training any model



In [18]:
#Lets train a Logistic Regression model that predicts survival — and this is where machine learning actually happens
#Part 2 – Train & Evaluate Logistic Regression
#Step 1: Import Required Tools

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix

#Question:1)Why are we using Logisctic Regression model? 2)what is thumb rule to decide a model 3)What are metrics how are they useful and when to use?

In [19]:
#Train the model
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)

# max_iter=1000 ensures the model has enough time to converge. Default is often too low.

#Questions:1)What happens when we call the fit function? 2)is max_iter 1000 in milliseconds? how does it impact?

 Q1: What happens when we call .fit()?

lr.fit(X_train, y_train)
🎯 .fit() = Train the model
The model looks at each row in X_train, compares its prediction to the real y_train, and adjusts itself to minimize error.

⚙️ For Logistic Regression, .fit() means:
The model initializes some random weights (coefficients)

For each row, it calculates a predicted probability

It compares that to the true label (0 or 1)

Then it adjusts the weights using a method like Gradient Descent to reduce the overall prediction error (called loss)

This cycle repeats until the model converges — meaning the changes are very small and it's "happy" with the result

✅ You dont need to manually do any math — but thats the core of whats happening!

✅ Q2: Is max_iter=1000 in milliseconds?
❌ No — it's not time, it’s number of training cycles, called iterations.

💡 What max_iter=1000 really means:
"Try up to 1000 rounds of adjusting weights during training."

If the model converges (stabilizes) earlier, it stops.
If not, it continues up to 1000 iterations.

⚠️ Why we increase it:
The default max_iter is often 100 or 200, which may not be enough for the model to fully converge, especially:

With many features

When the data is scaled weirdly

Or when the model needs more time to find the optimal weights

✅ By setting max_iter=1000, you're making sure the model has enough chance to learn.

❗ If you don’t increase it and it doesn’t converge:
You’ll see this warning:

vbnet
Copy
Edit
ConvergenceWarning: lbfgs failed to converge. Increase the number of iterations.
✅ TL;DR:
Concept	Meaning
.fit()	Trains the model by learning from X and y
max_iter=1000	Up to 1000 rounds of training (not time-based)
Why 1000?	Helps prevent convergence failure, especially with larger datasets



In [20]:
#Predict on Test Data
y_pred_lr = lr.predict(X_test)
print(y_pred_lr)

[0 1 1 0 1 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 1 0 1 1 0 1 0 0 1
 1 0 0 0 1 0 0 0 1 1 0 0 1 1 1 0 0 1 1 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 1 1 0
 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0
 1 0 1 0 0 0 0 1 0 0 1 0 0 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 1 1 0 1 1 1 0 0 1
 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 0 0 1 1 1 1 0 1 0 0 1]


In [21]:
# Evaluate the Model
print("Accuracy:",accuracy_score(y_test,y_pred_lr))
print("Confusion Matrix:\n",confusion_matrix(y_test,y_pred_lr))
print("Classification Report:\n",classification_report(y_test,y_pred_lr))

Accuracy: 0.7808988764044944
Confusion Matrix:
 [[86 23]
 [16 53]]
Classification Report:
               precision    recall  f1-score   support

           0       0.84      0.79      0.82       109
           1       0.70      0.77      0.73        69

    accuracy                           0.78       178
   macro avg       0.77      0.78      0.77       178
weighted avg       0.79      0.78      0.78       178



| Metric                  | What It Tells You                |
| ----------------------- | -------------------------------- |
| **Accuracy**            | Overall % of correct predictions |
| **Confusion Matrix**    | How many TP, TN, FP, FN          |
| **Precision/Recall/F1** | How good model is for each class |


✅ Q1: Why are we using Logistic Regression here?
Because your target variable Survived is:

Binary → Only two values: 0 (did not survive), 1 (survived)

🔍 Logistic Regression is the go-to model when:
Case	Reason
🎯 Target is binary (0/1, yes/no, true/false)	✅ Logistic regression is designed for this
🔢 You want to predict probabilities	✅ It outputs a probability (e.g., “80% chance survived”)
💬 You need a simple, interpretable model	✅ Easy to explain to non-technical people
🧠 You want a baseline model	✅ Great starting point before trying complex models

✅ Q2: What’s the thumb rule to decide which model to use?
There’s no one-size-fits-all, but here’s a cheat sheet:

Problem Type	Use These Models (start with first)
✅ Binary Classification (0/1)	Logistic Regression, Decision Tree, Random Forest
🎯 Multiclass Classification	Decision Tree, Random Forest, Gradient Boosting, XGBoost
🧮 Regression (predicting numbers)	Linear Regression, Decision TreeRegressor, Random Forest
🧠 Text/NLP	Naive Bayes, Logistic Regression, Transformers
🖼️ Images	CNNs (via deep learning)
💥 Large data / nonlinear	Random Forest, Gradient Boost, XGBoost, SVM

✅ Rule: Always start simple (LogReg or Tree), then move to complex if needed.

✅ Q3: What are metrics in ML? Why and when to use?
🧠 Definition:
Metrics are scores that tell you how good or bad your model is performing.

📏 Why do we use them?
Because just accuracy is often not enough — we need to know:

What kinds of errors the model is making

Is it biased toward a class?

Is it good at catching rare events (like fraud detection)?

✅ Common Classification Metrics:
Metric	What It Tells You	When to Use
Accuracy	% of total correct predictions	✅ Only when data is balanced
Precision	Of all predicted positives, how many were correct?	✅ When false positives are costly (e.g., spam detection)
Recall	Of all actual positives, how many did we find?	✅ When false negatives are costly (e.g., cancer detection)
F1 Score	Balance between precision & recall	✅ When you want a single score for imbalanced data
Confusion Matrix	Breakdown of TP, TN, FP, FN	✅ For analyzing detailed performance

🧪 Example:
You’re predicting spam email (0 = not spam, 1 = spam)

You care about not marking real emails as spam → Precision is more important

You care about not missing any spam at all → Recall is more important

⚠️ Why Not Just Use Accuracy?
Example	Accuracy Looks Good, But...
Fraud detection (99.9% not fraud)	A model that always predicts “no fraud” = 99.9% accuracy, but totally useless
Rare disease detection	Model misses most sick patients, still gets “good” accuracy

✅ TL;DR Summary:
Concept	Key Point
Logistic Regression	Best for binary target
Model choice	Start simple, depends on task
Metrics	Evaluate how and where your model is right or wrong

In [None]:
#Now compare this model with other models
#We now test if Decision Tree or Random Forest can beat Logistic Regression.

| Model             | What It Brings                     |
| ----------------- | ---------------------------------- |
| **Decision Tree** | Handles non-linear patterns better |
| **Random Forest** | More accurate, less overfitting    |


In [22]:
#Train Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train,y_train)
y_pred_dt = dt.predict(X_test)

print("DT Accuracy:",accuracy_score(y_test,y_pred_dt))
print("DT ConfusionMatrix:",confusion_matrix(y_test,y_pred_dt))
print("DT ClassificationReport:",classification_report(y_test,y_pred_dt))


DT Accuracy: 0.7303370786516854
DT ConfusionMatrix: [[79 30]
 [18 51]]
DT ClassificationReport:               precision    recall  f1-score   support

           0       0.81      0.72      0.77       109
           1       0.63      0.74      0.68        69

    accuracy                           0.73       178
   macro avg       0.72      0.73      0.72       178
weighted avg       0.74      0.73      0.73       178



In [24]:
#Train Random Forest
from sklearn.ensemble import RandomForestClassifier

rf= RandomForestClassifier()
rf.fit(X_train,y_train)

y_pred_rf = rf.predict(X_test)
print("RF Accuracy:",accuracy_score(y_test,y_pred_rf))
print("RF ConfusionMatrix:\n",confusion_matrix(y_test,y_pred_rf))
print("RF ClassificationReport:\n",classification_report(y_test,y_pred_rf))


RF Accuracy: 0.797752808988764
RF ConfusionMatrix:
 [[92 17]
 [19 50]]
RF ClassificationReport:
               precision    recall  f1-score   support

           0       0.83      0.84      0.84       109
           1       0.75      0.72      0.74        69

    accuracy                           0.80       178
   macro avg       0.79      0.78      0.79       178
weighted avg       0.80      0.80      0.80       178



Step 1: Confusion Matrix — What Is It?
Let’s say you're predicting if a passenger survived (1) or did not survive (0):
Here’s a possible confusion matrix output:

lua
Copy
Edit
               Predicted
             |  0  |  1  
         -----------------
   Actual 0 | TN  | FP  
   Actual 1 | FN  | TP  
Term	Meaning
TP (True Positive)	Model predicted 1 (survived), and it's correct ✅
TN (True Negative)	Model predicted 0 (did not survive), and it's correct ✅
FP (False Positive)	Model predicted 1, but it was actually 0 ❌
FN (False Negative)	Model predicted 0, but it was actually 1 ❌

💡 Example:
You predicted survival:

Passenger	True	Predicted	Type
1	1	1	✅ TP
2	0	1	❌ FP
3	0	0	✅ TN
4	1	0	❌ FN

📊 Step 2: Classification Report Breakdown
This gives you:

python
Copy
Edit
print(classification_report(y_test, y_pred))
Output:

Metric	Class 0	Class 1
Precision	0.75	0.68
Recall	0.79	0.63
F1-score	0.77	0.65
Support	110	79

Let’s decode them:

✅ Precision:
Of all passengers the model predicted as survived (1), how many actually survived?

ini
Copy
Edit
Precision = TP / (TP + FP)
✅ High precision = few false alarms
👉 Useful when false positives are expensive (e.g., spam filter)

✅ Recall:
Of all passengers who actually survived, how many did the model correctly catch?

ini
Copy
Edit
Recall = TP / (TP + FN)
✅ High recall = few missed actual cases
👉 Useful when false negatives are dangerous (e.g., cancer detection)

✅ F1 Score:
Balance between precision & recall (good when data is imbalanced)

ini
Copy
Edit
F1 = 2 * (Precision * Recall) / (Precision + Recall)
✅ Think of it as the overall skill of the model at that class.

✅ Support:
Number of actual occurrences in your test set
(so if 79 people actually survived, support for class 1 = 79)

🧠 TL;DR Summary
Term	Think of It As...	Formula
Precision	"How accurate are my positive predictions?"	TP / (TP + FP)
Recall	"How well did I catch all real positives?"	TP / (TP + FN)
F1 Score	"Balanced score of precision & recall"	harmonic mean
Accuracy	"Overall correctness"	(TP + TN) / Total

✅ Easy Analogy:
Imagine a COVID test:

Term	Meaning
TP	Sick person → test says "positive" ✅
FP	Healthy person → test says "positive" ❌
FN	Sick person → test says "negative" ❌ (dangerous)
TN	Healthy person → test says "negative" ✅



Precision vs recall


Both precision and recall focus on the positive class (usually 1 = something happened, like survived, or fraud, or spam).

🧠 Think of Precision and Recall Like This:
Metric	Focuses on...	Real-Life Analogy
Precision ✅	"Of what I predicted as positive, how many were actually correct?"	📥 Email: Of all emails marked spam, how many actually were spam?
Recall 🔍	"Of all actual positives, how many did I catch?"	🧪 COVID Test: Of all people who had COVID, how many did the test correctly catch?

⚖️ Precision vs Recall = Quality vs Coverage
Precision	Recall
✅ High Precision	Less false positives (fewer wrong positives)	You’re confident in positives
✅ High Recall	Less false negatives (fewer misses)	You catch more real cases
❌ Low Precision	Model guesses “positive” too often, many wrong	
❌ Low Recall	Model misses real positives (dangerous)	

🔥 Easy Trick to Remember:
Precision = How precise is the model when it says "Yes"?
Recall = Did the model remember to catch all the real Yes cases?



🧪 Example:
There are 10 real celebrities
You let in 6 people
But only 3 were real celebrities

Precision = 3 out of 6 → 3/6 = 0.5

Recall = 3 out of 10 → 3/10 = 0.3