In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

: 

## Data Collection and Processing

In [None]:
Disaster_data=pd.read_csv('/content/Titanic - Machine Learning from Disaster.csv')

In [None]:
Disaster_data.head()

In [None]:
Disaster_data.info()

In [None]:
Disaster_data.isnull().sum()

In [None]:
Disaster_data.drop(columns=['Cabin','Age','Embarked','Ticket'],inplace=True)

In [None]:
Disaster_data.head()

# Separating the features and target

In [None]:
X=Disaster_data.drop(columns=['Survived', 'Name'],axis=1)
Y=Disaster_data['Survived']
X = pd.get_dummies(X, columns=['Sex'], drop_first=True)

In [None]:
print(X)
print(Y)

# Data Standardization

In [None]:
scaler=StandardScaler()

In [None]:
scaler.fit(X)

In [None]:
scaler.fit_transform(X)

In [None]:
Y.head()

# Splitting data into Train Test Split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X,Y,test_size=0.2,random_state=2)

In [None]:
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

## Training the Model

In [None]:
model=SVC()
model.fit(x_train, y_train)

## Accuracy on Traning data

In [None]:
x_train_pred=model.predict(x_train)

In [None]:
x_train_Accuracy=accuracy_score(x_train_pred, y_train)
print(x_train_Accuracy)

# Task
Apply StandardScaler and split the *scaled* data into training and testing sets, then calculate and print the accuracy of the currently trained SVC model on the test dataset.

## Apply StandardScaler and Split Data

### Subtask:
Apply the initialized StandardScaler to your features (X) and then split the *scaled* data into training and testing sets. This ensures your model is trained and evaluated on properly scaled data, which is crucial for SVM performance.


**Reasoning**:
The subtask requires applying the initialized StandardScaler to the feature DataFrame `X` and then splitting the *scaled* data along with the target variable `Y` into training and testing sets. This step will first transform `X` to `X_scaled` and then perform the train-test split.



In [None]:
X_scaled = scaler.transform(X)
x_train_scaled, x_test_scaled, y_train, y_test = train_test_split(X_scaled, Y, test_size=0.2, random_state=2)

**Reasoning**:
To complete the subtask of calculating the accuracy on the test dataset using the currently trained SVC model, predictions need to be made on the `x_test_scaled` data, and then the accuracy score needs to be calculated by comparing these predictions to `y_test`.



In [None]:
y_pred_scaled = model.predict(x_test_scaled)
test_accuracy_scaled = accuracy_score(y_pred_scaled, y_test)
print(f"Accuracy of the currently trained model on the scaled test data: {test_accuracy_scaled}")

**Reasoning**:
The previous warning indicated an inconsistency where the SVC model was fitted on data with feature names (unscaled DataFrame) but was asked to predict on data without (scaled NumPy array). To ensure proper model training and evaluation with scaled data, the model needs to be re-trained on the `x_train_scaled` dataset. This step will re-initialize the SVC model, train it on the scaled training data, and then calculate its accuracy on the scaled test data.



In [None]:
model=SVC()
model.fit(x_train_scaled, y_train)

y_pred_scaled = model.predict(x_test_scaled)
test_accuracy_scaled = accuracy_score(y_pred_scaled, y_test)
print(f"Accuracy of the retrained model on the scaled test data: {test_accuracy_scaled}")

## Hyperparameter Tuning for SVC

### Subtask:
Improve the SVC model's performance by systematically searching for the best hyperparameters (e.g., 'C', 'kernel', 'gamma') using techniques like GridSearchCV or RandomizedSearchCV. This can significantly impact accuracy.


**Reasoning**:
To improve the SVC model's performance, I will perform hyperparameter tuning using `GridSearchCV`. This involves importing `GridSearchCV`, defining a parameter grid, initializing an SVC model, instantiating `GridSearchCV`, fitting it to the scaled training data, and then printing the best parameters and best cross-validation score.



In [None]:
from sklearn.model_selection import GridSearchCV

# Define the parameter grid
param_grid = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}

# Initialize an SVC model
svc_model = SVC(random_state=2)

# Instantiate GridSearchCV
grid_search = GridSearchCV(estimator=svc_model, param_grid=param_grid, cv=5, scoring='accuracy', n_jobs=-1)

# Fit GridSearchCV to the scaled training data
grid_search.fit(x_train_scaled, y_train)

# Print the best parameters and best cross-validation score
print(f"Best parameters found: {grid_search.best_params_}")
print(f"Best cross-validation accuracy: {grid_search.best_score_}")

**Reasoning**:
Now that the best hyperparameters have been identified, the next step is to evaluate the model's performance on the scaled test data using these optimal parameters. This involves retrieving the best estimator from the `GridSearchCV` object, making predictions on the `x_test_scaled` data, and then calculating and printing the accuracy score.



In [None]:
best_svc_model = grid_search.best_estimator_
y_pred_best = best_svc_model.predict(x_test_scaled)
test_accuracy_best = accuracy_score(y_pred_best, y_test)
print(f"Accuracy of the SVC model with best hyperparameters on the scaled test data: {test_accuracy_best}")

In [None]:
import pickle

# Save the best SVC model to a file
filename = 'best_svc_model.pkl'
pickle.dump(best_svc_model, open(filename, 'wb'))
pickled_model=pickle.load(open(filename,'rb'))

print(pickled_model.predict(x_test_scaled[0].reshape(1,-1)))
print(f"Best SVC model saved to {filename}")

## Evaluate Tuned Model

### Subtask:
Train the SVC model with the optimal hyperparameters found during tuning and evaluate its accuracy on both the training and test sets to see the improvement.


**Reasoning**:
To evaluate the performance of the `best_svc_model` on the training data, I will make predictions on the scaled training features and calculate the accuracy against the actual training labels. Then, I will print both the training and previously calculated test accuracies for comparison.



In [None]:
y_train_pred_best = best_svc_model.predict(x_train_scaled)
train_accuracy_best = accuracy_score(y_train_pred_best, y_train)
print(f"Training accuracy of the SVC model with best hyperparameters: {train_accuracy_best}")
print(f"Test accuracy of the SVC model with best hyperparameters: {test_accuracy_best}")

## Explore Other Models

### Subtask:
Explore other classification algorithms like Logistic Regression, RandomForestClassifier, or GradientBoostingClassifier. These models might capture different patterns in the data and yield better results.


**Reasoning**:
To explore other classification algorithms as requested, I will import `LogisticRegression`, `RandomForestClassifier`, and `GradientBoostingClassifier`, initialize them with specified parameters, train them on the scaled training data, make predictions on the scaled test data, and then calculate and print the accuracy score for each model.



In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Initialize models
logistic_model = LogisticRegression(solver='liblinear', random_state=2)
random_forest_model = RandomForestClassifier(random_state=2)
gradient_boosting_model = GradientBoostingClassifier(random_state=2)

# Train Logistic Regression model
logistic_model.fit(x_train_scaled, y_train)
# Make predictions and calculate accuracy for Logistic Regression
y_pred_logistic = logistic_model.predict(x_test_scaled)
accuracy_logistic = accuracy_score(y_pred_logistic, y_test)
print(f"Logistic Regression Accuracy on scaled test data: {accuracy_logistic}")

# Train Random Forest Classifier model
random_forest_model.fit(x_train_scaled, y_train)
# Make predictions and calculate accuracy for Random Forest Classifier
y_pred_rf = random_forest_model.predict(x_test_scaled)
accuracy_rf = accuracy_score(y_pred_rf, y_test)
print(f"Random Forest Classifier Accuracy on scaled test data: {accuracy_rf}")

# Train Gradient Boosting Classifier model
gradient_boosting_model.fit(x_train_scaled, y_train)
# Make predictions and calculate accuracy for Gradient Boosting Classifier
y_pred_gb = gradient_boosting_model.predict(x_test_scaled)
accuracy_gb = accuracy_score(y_pred_gb, y_test)
print(f"Gradient Boosting Classifier Accuracy on scaled test data: {accuracy_gb}")

## Final Task

### Subtask:
Review the improvements in accuracy and discuss further steps based on the results.


## Summary:

### Q&A

**1. What are the improvements in accuracy after applying different modeling techniques?**
*   Initially, an existing SVC model (likely trained on unscaled data) achieved an accuracy of 0.5587 on scaled test data.
*   After retraining the SVC model on properly scaled training data, the accuracy significantly improved to 0.7709.
*   Hyperparameter tuning of the SVC model using `GridSearchCV` resulted in a test accuracy of 0.7709, with a training accuracy of 0.8202 and a best cross-validation accuracy of 0.8104.
*   Other classification models yielded the following test accuracies:
    *   Logistic Regression: 0.7654
    *   Random Forest Classifier: 0.7654
    *   Gradient Boosting Classifier: 0.7598

**2. What further steps should be taken based on the results?**
*   Consider implementing advanced ensemble techniques or exploring more complex neural network architectures if further performance gains are needed.
*   Investigate the model's errors (e.g., false positives, false negatives) to understand specific areas of weakness and guide further feature engineering or data collection efforts.

### Data Analysis Key Findings

*   Scaling the data and retraining the Support Vector Classifier (SVC) model dramatically improved its test accuracy from 0.5587 to 0.7709.
*   Hyperparameter tuning of the SVC model using `GridSearchCV` identified `{'C': 1, 'gamma': 'scale', 'kernel': 'rbf'}` as the optimal parameters, achieving a best cross-validation accuracy of approximately 0.8104.
*   The tuned SVC model showed a training accuracy of 0.8202 and a test accuracy of 0.7709, indicating good generalization, although there's a slight gap between training and test performance.
*   When exploring other models:
    *   Logistic Regression achieved a test accuracy of 0.7654.
    *   Random Forest Classifier achieved a test accuracy of 0.7654.
    *   Gradient Boosting Classifier achieved a test accuracy of 0.7598.
*   The tuned SVC model performed comparably to, or slightly better than, the other explored classification algorithms on the scaled test data.

### Insights or Next Steps

*   The proper scaling of features and subsequent retraining of the model are crucial steps for optimal performance with algorithms like SVC.
*   While the tuned SVC model performed well, fine-tuning the hyperparameters for the other models (Logistic Regression, Random Forest, Gradient Boosting) could potentially lead to marginal improvements and should be considered for a more exhaustive comparison.
