<a href="https://colab.research.google.com/github/vaisshnavee1410/ASSIGNMENT-7-Logistic-Regression-.ipynb/blob/main/Logistic_Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **LOGISTIC REGRESSION**

### **1.Data Exploration:**

**a) Load the dataset and perform exploratory data analysis (EDA)**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")


**b) Examine the features, their types, and summary statistics.**

In [None]:
# Display dataset information (feature names, data types, and non-null counts)
print("Dataset Information:")
print(train_df.info())

# Display summary statistics for numerical features
print("\nSummary Statistics for Numerical Features:")
print(train_df.describe())

# Display summary statistics for categorical features
print("\nSummary Statistics for Categorical Features:")
print(train_df.describe(include=['O']))  # 'O' refers to object (categorical) data type

# Display basic info and first few rows
print("Dataset Information:")
print(train_df.info())
print("\nFirst 5 Rows of Training Data:")
print(train_df.head())

# Summary statistics
print("\nSummary Statistics:")
print(train_df.describe())

**c) Create visualizations such as histograms, box plots, or pair plots to visualize the distributions and
relationships between features.**

In [None]:
# Visualizations

# Histogram of numerical features
train_df.hist(figsize=(10, 8), bins=20, edgecolor='black')
plt.tight_layout()
plt.show()

# Box plot of numerical features
plt.figure(figsize=(12, 6))
sns.boxplot(data=train_df[["Age", "Fare", "SibSp", "Parch"]])
plt.title("Box Plots of Numerical Features")
plt.show()

# Correlation heatmap
plt.figure(figsize=(10, 6))

# Select only numerical features for correlation calculation
numerical_df = train_df.select_dtypes(include=['number'])
sns.heatmap(numerical_df.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Correlation Heatmap")
plt.show()

# Pair plot
train_df["Survived"] = train_df["Survived"].astype("category")
sns.pairplot(train_df, hue="Survived", vars=["Age", "Fare", "SibSp", "Parch"])
plt.show()

**d) Analyze any patterns or correlations observed in the data.**

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Convert categorical variables for visualization
train_df["Survived"] = train_df["Survived"].astype("category")
train_df["Sex"] = train_df["Sex"].astype("category")
train_df["Embarked"] = train_df["Embarked"].astype("category")

# 1. Survival rate by Passenger Class (Pclass)
plt.figure(figsize=(8, 5))
sns.barplot(x="Pclass", y="Survived", data=train_df, ci=None, palette="coolwarm")
plt.title("Survival Rate by Passenger Class")
plt.ylabel("Survival Probability")
plt.xlabel("Passenger Class")
plt.show()

# 2. Survival rate by Gender (Sex)
plt.figure(figsize=(6, 5))
sns.barplot(x="Sex", y="Survived", data=train_df, ci=None, palette="coolwarm")
plt.title("Survival Rate by Gender")
plt.ylabel("Survival Probability")
plt.xlabel("Gender")
plt.show()

# 3. Survival rate by Embarkation Port (Embarked)
plt.figure(figsize=(6, 5))
sns.barplot(x="Embarked", y="Survived", data=train_df, ci=None, palette="coolwarm")
plt.title("Survival Rate by Embarkation Port")
plt.ylabel("Survival Probability")
plt.xlabel("Port of Embarkation")
plt.show()

# 4. Age distribution of survivors vs non-survivors
plt.figure(figsize=(10, 6))
sns.kdeplot(train_df.loc[train_df["Survived"] == 1, "Age"], label="Survived", shade=True, color="green")
sns.kdeplot(train_df.loc[train_df["Survived"] == 0, "Age"], label="Did Not Survive", shade=True, color="red")
plt.title("Age Distribution: Survivors vs Non-Survivors")
plt.xlabel("Age")
plt.ylabel("Density")
plt.legend()
plt.show()

# 5. Fare distribution for Survivors vs Non-Survivors
plt.figure(figsize=(10, 6))
sns.boxplot(x="Survived", y="Fare", data=train_df, palette="coolwarm")
plt.title("Fare Distribution by Survival Status")
plt.xlabel("Survived")
plt.ylabel("Fare")
plt.show()

### **2. Data Preprocessing:**


**a)  Handle missing values (e.g., imputation).**

In [None]:
import pandas as pd

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

# =======================
# 1. Checking Missing Values
# =======================
print("Missing values before handling:\n", train_df.isnull().sum())

# =======================
# 2. Handling Missing Values
# =======================

# Fill missing 'Age' values with the median age
train_df["Age"].fillna(train_df["Age"].median(), inplace=True)
test_df["Age"].fillna(test_df["Age"].median(), inplace=True)

# Fill missing 'Fare' values with the median fare
train_df["Fare"].fillna(train_df["Fare"].median(), inplace=True)
test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)

# Fill missing 'Embarked' values with the mode (most common value)
train_df["Embarked"].fillna(train_df["Embarked"].mode()[0], inplace=True)
test_df["Embarked"].fillna(test_df["Embarked"].mode()[0], inplace=True)

# Drop 'Cabin' column (too many missing values)
train_df.drop(columns=["Cabin"], inplace=True)
test_df.drop(columns=["Cabin"], inplace=True)

# =======================
# 3. Checking Missing Values After Handling
# =======================
print("\nMissing values after handling:\n", train_df.isnull().sum())

**b) Encode categorical variables.**

In [None]:
# 1. Encoding Categorical Variables

# Convert 'Sex' column into numerical (0 = male, 1 = female)
train_df["Sex"] = train_df["Sex"].map({"male": 0, "female": 1})
test_df["Sex"] = test_df["Sex"].map({"male": 0, "female": 1})


# 2. Display Processed Data

print("Missing values after preprocessing:\n", train_df.isnull().sum())
print("\nFirst 5 rows of the processed training dataset:")
print(train_df.head())

### **3. Model Building:**

**a). Build a logistic regression model using appropriate libraries (e.g., scikit-learn).**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

# Handle missing values
train_df["Age"].fillna(train_df["Age"].median(), inplace=True)
test_df["Age"].fillna(test_df["Age"].median(), inplace=True)

train_df["Fare"].fillna(train_df["Fare"].median(), inplace=True)
test_df["Fare"].fillna(test_df["Fare"].median(), inplace=True)

train_df["Embarked"].fillna(train_df["Embarked"].mode()[0], inplace=True)
test_df["Embarked"].fillna(test_df["Embarked"].mode()[0], inplace=True)

# Drop 'Cabin' column (too many missing values)
train_df.drop(columns=["Cabin"], inplace=True)
test_df.drop(columns=["Cabin"], inplace=True)

# Encode categorical variables
train_df["Sex"] = train_df["Sex"].map({"male": 0, "female": 1})
test_df["Sex"] = test_df["Sex"].map({"male": 0, "female": 1})

train_df["Embarked"] = train_df["Embarked"].map({"C": 0, "Q": 1, "S": 2})
test_df["Embarked"] = test_df["Embarked"].map({"C": 0, "Q": 1, "S": 2})

# ==========================
# Select Features and Target
# ==========================
# Select relevant features
features = ["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "Embarked"]
X = train_df[features]  # Independent variables
y = train_df["Survived"]  # Target variable

# Split data into training and validation sets (80% train, 20% test)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (improves model performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

# ==========================
# Train Logistic Regression Model
# ==========================
# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

# ==========================
# Model Evaluation
# ==========================
# Predict on validation data
y_pred = model.predict(X_val)

# Calculate accuracy
accuracy = accuracy_score(y_val, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

# Display classification report
print("\nClassification Report:\n", classification_report(y_val, y_pred))

# Display confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_pred))


**b). Train the model using the training data.**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")


# Handle missing values
train_df["Age"].fillna(train_df["Age"].median(), inplace=True)
train_df["Fare"].fillna(train_df["Fare"].median(), inplace=True)
train_df["Embarked"].fillna(train_df["Embarked"].mode()[0], inplace=True)

# Encode categorical variables
train_df["Sex"] = train_df["Sex"].map({"male": 0, "female": 1})
train_df["Embarked"] = train_df["Embarked"].map({"C": 0, "Q": 1, "S": 2})

# ==========================
# Select Features and Target
# ==========================
features = ["Pclass", "Sex", "Age", "Fare", "SibSp", "Parch", "Embarked"]
X = train_df[features]  # Independent variables
y = train_df["Survived"]  # Target variable

# Split data into training and validation sets (80% train, 20% validation)
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Standardize features (scaling helps improve model performance)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_val = scaler.transform(X_val)

# ==========================
# Train the Logistic Regression Model
# ==========================
# Initialize the model
model = LogisticRegression()

# Train (fit) the model on training data
model.fit(X_train, y_train)

# ==========================
# Model Training Complete
# ==========================
print("Logistic Regression model trained successfully!")

### **4.Model Evaluation:**

**a). Evaluate the performance of the model on the testing data using accuracy, precision, recall, F1-
score, and ROC-AUC score.**

  
  **Visualize the ROC curve.**

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, roc_curve, classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

# Assuming X_val and y_val are your validation data
y_pred = model.predict(X_val)  # Predictions on validation data
y_pred_proba = model.predict_proba(X_val)[:, 1]  # Probability of survival

# Calculate evaluation metrics
accuracy = accuracy_score(y_val, y_pred)
precision = precision_score(y_val, y_pred)
recall = recall_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred)
roc_auc = roc_auc_score(y_val, y_pred_proba)

# Print evaluation metrics
print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")
print(f"Recall: {recall:.4f}")
print(f"F1-score: {f1:.4f}")
print(f"ROC-AUC Score: {roc_auc:.4f}")

# Print classification report
print("\nClassification Report:\n", classification_report(y_val, y_pred))

# Print confusion matrix
print("\nConfusion Matrix:\n", confusion_matrix(y_val, y_pred))

# ==========================
# 5. Plot the ROC Curve
# ==========================
fpr, tpr, _ = roc_curve(y_val, y_pred_proba)  # Use y_val here
plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color="blue", label=f"ROC Curve (AUC = {roc_auc:.2f})")
plt.plot([0, 1], [0, 1], linestyle="--", color="gray")  # Diagonal line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve for Titanic Survival Prediction")
plt.legend()
plt.show()

### **5. Interpretation:**

**a). Interpret the coefficients of the logistic regression model.**

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Load the dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

# Interpret Model Coefficients
coefficients = model.coef_[0]

# Create a DataFrame to display feature names with their respective coefficients
coef_df = pd.DataFrame({"Feature": features, "Coefficient": coefficients})

# Sort by absolute coefficient values
coef_df["Abs_Coefficient"] = coef_df["Coefficient"].abs()
coef_df = coef_df.sort_values(by="Abs_Coefficient", ascending=False).drop(columns=["Abs_Coefficient"])

# Display the coefficients
print("\nLogistic Regression Coefficients:\n")
print(coef_df)

**b). Discuss the significance of features in predicting the target variable (survival probability in this
case)**

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Load the Titanic dataset
train_df = pd.read_csv("Titanic_train.csv")
test_df = pd.read_csv("Titanic_test.csv")

# Data Preprocessing
def preprocess_data(df):
    # Encode categorical variables
    df['Sex'] = LabelEncoder().fit_transform(df['Sex'])  # Male=1, Female=0
    df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
    df['Embarked'] = LabelEncoder().fit_transform(df['Embarked'])

    # Fill missing values in Age using median
    imputer = SimpleImputer(strategy="median")
    df['Age'] = imputer.fit_transform(df[['Age']])

    return df

train_df = preprocess_data(train_df)
test_df = preprocess_data(test_df)

# Select features and target variable
features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked']
X = train_df[features]
y = train_df['Survived']

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train a RandomForestClassifier model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Predict on the test set
y_pred = model.predict(X_test)

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))
print("\nConfusion Matrix:\n", confusion_matrix(y_test, y_pred))

# Cross-validation score
cv_scores = cross_val_score(model, X, y, cv=5)
print("\nCross-validation Accuracy:", np.mean(cv_scores))

# Feature Importance Visualization
feature_importance = pd.Series(model.feature_importances_, index=features).sort_values(ascending=False)

plt.figure(figsize=(8, 5))
sns.barplot(x=feature_importance, y=feature_importance.index, palette="viridis")
plt.xlabel("Feature Importance Score")
plt.ylabel("Features")
plt.title("Feature Importance for Survival Prediction")
plt.show()

# Exploratory Data Analysis (EDA) - Survival Rate by Gender
plt.figure(figsize=(6, 4))
sns.barplot(x=train_df["Sex"], y=train_df["Survived"], palette="coolwarm")
plt.xticks([0, 1], ["Female", "Male"])
plt.ylabel("Survival Rate")
plt.title("Survival Rate by Gender")
plt.show()

# Predict on test dataset
test_predictions = model.predict(test_df[features])

# Save predictions
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': test_predictions})
submission.to_csv("Titanic_Survival_Predictions.csv", index=False)
print("Predictions saved to Titanic_Survival_Predictions.csv")

### **INTERVIEW QUESTIONS:**

**1.What is the difference between precision and recall?**

  Precision and recall are key metrics used in evaluating the performance of classification models, especially in imbalanced datasets.

* **Precision (Positive Predictive Value):**

  Precision measures how many of the predicted positive cases are actually positive. It focuses on the correctness of positive predictions.


 **Example:** If a spam filter detects 100 spam emails but only 80 of them are actual spam (and 20 are not), the precision is \frac{80}{100} = 0.8 (80%).


* **Recall (Sensitivity or True Positive Rate):**

  Recall measures how many actual positive cases were correctly identified by the model. It focuses on minimizing false negatives.

**Example:** If there are 120 spam emails in total, and the model correctly identifies 80 of them as spam, the recall is \frac{80}{120} = 0.67 (67%).

* **Key Difference:**

•	Precision is about correctness of positive predictions.

•	Recall is about completeness in finding positive cases.

* **Trade-off Between Precision and Recall:**

•	Increasing precision often decreases recall (fewer false positives but more false negatives).

•	Increasing recall often decreases precision (fewer false negatives but more false positives).

**2. What is cross-validation, and why is it important in binary classification?**

Cross-validation is a technique used in machine learning to evaluate the performance of a model by splitting the dataset into multiple subsets. It helps ensure that the model generalizes well to unseen data rather than just memorizing patterns in the training data.

* **How Cross-Validation Works:**

1.	The dataset is divided into k subsets (or “folds”).

2.	The model is trained on k-1 folds and tested on the remaining fold.

3.	This process repeats k times, each time using a different fold as the test set.

4.	The final performance metric (e.g., accuracy, precision, recall) is averaged across all k iterations.


**Example:** (5-Fold Cross-Validation):

1.	Split data into 5 folds.

2.	Train on 4 folds, test on 1 fold.

3.	Repeat this 5 times, each time selecting a different fold for testing.

4.	Average the results.

* **Why is Cross-Validation Important in Binary Classification?**

Binary classification involves predicting one of two classes (e.g., spam vs. not spam, disease vs. no disease). Cross-validation is especially useful in this case because:

  **1.	Prevents Overfitting:**

•	Ensures the model is not just memorizing patterns but generalizing well to new data.

**2.	Provides More Reliable Metrics:**

•	Instead of relying on a single train-test split, cross-validation averages performance over multiple trials.

**3.	Handles Imbalanced Datasets Better:**

•	In binary classification with class imbalance, a simple train-test split might not capture the minority class well. Cross-validation ensures all data is used for training and testing.

**4.	Optimizes Hyperparameters:**

•	When tuning hyperparameters (e.g., learning rate, depth of a decision tree), cross-validation provides a better estimate of the best settings.

**5.	Makes Efficient Use of Data:**

•	Instead of wasting data by setting aside a large test set, cross-validation ensures every data point is used for both training and testing at some point.


**Types of Cross-Validation**

1.	k-Fold Cross-Validation (Most Common)
	•	Data is split into k parts, and training/testing happens k times.

2.	Stratified k-Fold Cross-Validation
	•	Ensures each fold has the same proportion of classes as the original dataset (important for imbalanced datasets).

3.	Leave-One-Out Cross-Validation (LOOCV)
	•	Each data point is used as a test set once, and all other points are used for training.

* **Conclusion:**

Cross-validation is a powerful technique that improves the reliability and generalizability of a binary classification model. It ensures the model is evaluated fairly, prevents overfitting, and helps in hyperparameter tuning.