## INSTRUCTIONS 

Every learner should submit his/her own homework solutions. However, you are allowed to discuss the homework with each other– but everyone must submit his/her own solution; you may not copy someone else’s solution. 

The homework consists of two parts:
1.	Data from our life
2.	Classification

Follow the prompts in the attached jupyter notebook. We are using the same data as for the previous homeworks. Use the version you created called df2 where you already cleaned, dropped some of the variables but did not create dummy variables. Instead of creating dummy variables, you have to recode this column as suggested bellow.
Add markdown cells to your analysis to include your solutions, comments, answers. Add as many cells as you need, for easy readability comment when possible. 

**Note:** This homework has a bonus question, so the highest mark that can be earned is a 105.
Submission: Send in both a ipynb and a pdf file of your work.
Good luck!



# 1. Data from our lives:

### Describe a situation or problem from your job, everyday life, current events, etc., for which a classification would be appropriate.

## Your answer

Situation: Movie Genre Classification
In the vast world of movies, there's an incredible variety of genres—action, romance, comedy, thriller, and many more. Imagine a scenario where a movie enthusiast wants to build a system that can automatically categorize movies into their respective genres based solely on their plot summaries or descriptions.
Imagine a movie enthusiast developing a system to automatically categorize movies into genres using plot summaries. They gather a diverse dataset containing plot summaries linked to respective genres. Extracting key features like word frequency and sentiment, the system cleans and transforms text into machine-readable data. Using smart classification techniques, it learns patterns connecting features to genres through rigorous training and evaluation. Once validated, the system predicts genres for new movie plots, providing personalized recommendations, enhancing the movie-watching experience, and aiding users in discovering films aligned with their preferences on streaming platforms.

# 2. Preprocessing

In our class we covered multiple classification methods. In this part of the home work you can compare them 

**Use the dataset 'auto_imports1.csv' from our previous homeworks. More specifically, use the version you created called df2 where you already cleaned, dropped some of the variables but DID NOT CREATE dummy variables. Follow the prompts to complete the homework.**

In [51]:
from scipy import stats
from sklearn.linear_model import LinearRegression
from statsmodels.compat import lzip
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
from statsmodels.stats.outliers_influence import variance_inflation_factor
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm

%matplotlib inline

In [52]:
#Read in data
df =pd.read_csv('auto_imports1.csv')

df.head()

Unnamed: 0,fuel_type,body,wheel_base,length,width,heights,curb_weight,engine_type,cylinders,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,convertible,88.6,168.8,64.1,48.8,2548,dohc,four,130,3.47,2.68,9.0,111,5000,21,27,13495
1,gas,convertible,88.6,168.8,64.1,48.8,2548,dohc,four,130,3.47,2.68,9.0,111,5000,21,27,16500
2,gas,hatchback,94.5,171.2,65.5,52.4,2823,ohcv,six,152,2.68,3.47,9.0,154,5000,19,26,16500
3,gas,sedan,99.8,176.6,66.2,54.3,2337,ohc,four,109,3.19,3.4,10.0,102,5500,24,30,13950
4,gas,sedan,99.4,176.6,66.4,54.3,2824,ohc,five,136,3.19,3.4,8.0,115,5500,18,22,17450


In [53]:
##your code here
# To Check the data types in the DataFrame
car_data_types = df.dtypes

car_data_types

fuel_type       object
body            object
wheel_base     float64
length         float64
width          float64
heights        float64
curb_weight      int64
engine_type     object
cylinders       object
engine_size      int64
bore            object
stroke          object
comprassion    float64
horse_power     object
peak_rpm        object
city_mpg         int64
highway_mpg      int64
price            int64
dtype: object

In [54]:
## Your code here

# To Replace '?' with None
df = df.replace('?', None)

# To Convert bore, stroke, horse_power, peak_rpm to float64
object_columns_to_float = ["bore", "stroke", "horse_power", "peak_rpm"]
df[object_columns_to_float] = df[object_columns_to_float].astype(float)

# To Check if any remaining '?' values
if '?' in df.values:
    print("There are remaining '?' values in the DataFrame.")
else:
    print("There are no remaining '?' values in the DataFrame.")

There are no remaining '?' values in the DataFrame.


In [55]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201 entries, 0 to 200
Data columns (total 18 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel_type    201 non-null    object 
 1   body         201 non-null    object 
 2   wheel_base   201 non-null    float64
 3   length       201 non-null    float64
 4   width        201 non-null    float64
 5   heights      201 non-null    float64
 6   curb_weight  201 non-null    int64  
 7   engine_type  201 non-null    object 
 8   cylinders    201 non-null    object 
 9   engine_size  201 non-null    int64  
 10  bore         197 non-null    float64
 11  stroke       197 non-null    float64
 12  comprassion  201 non-null    float64
 13  horse_power  199 non-null    float64
 14  peak_rpm     199 non-null    float64
 15  city_mpg     201 non-null    int64  
 16  highway_mpg  201 non-null    int64  
 17  price        201 non-null    int64  
dtypes: float64(9), int64(5), object(4)
memory usage: 2

In [56]:
## Your code here

# Dropping body,engine_type, cylinders columns from the dataset and renaming as df2
df.drop(columns=["body", "engine_type", "cylinders"], inplace=True)

# renaming df as df2 
df2 = df.copy()



In [57]:
df2.head()

Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,gas,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,gas,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,gas,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


In [58]:
## your code goes here

## Droping rows with NaN values in the Dataset
df2 = df2.dropna()

In [59]:
df2.isnull().sum()

fuel_type      0
wheel_base     0
length         0
width          0
heights        0
curb_weight    0
engine_size    0
bore           0
stroke         0
comprassion    0
horse_power    0
peak_rpm       0
city_mpg       0
highway_mpg    0
price          0
dtype: int64

In [60]:
# ## Your code goes here


# # Creating dummy variables for fuel_type
# dummy_fuel_type = pd.get_dummies(df2['fuel_type'], prefix='fuel_type')

# # Droping the first level of dummy variable
# dummy_fuel_type = dummy_fuel_type.iloc[:, 1:]

# # Replacing the original 'fuel_type' column with the dummy variables
# df2 = pd.concat([df2, dummy_fuel_type], axis=1)

# # Droping the original 'fuel_type' column
# df2 = df2.drop(columns='fuel_type')


In [61]:
df2.info()

<class 'pandas.core.frame.DataFrame'>
Index: 195 entries, 0 to 200
Data columns (total 15 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   fuel_type    195 non-null    object 
 1   wheel_base   195 non-null    float64
 2   length       195 non-null    float64
 3   width        195 non-null    float64
 4   heights      195 non-null    float64
 5   curb_weight  195 non-null    int64  
 6   engine_size  195 non-null    int64  
 7   bore         195 non-null    float64
 8   stroke       195 non-null    float64
 9   comprassion  195 non-null    float64
 10  horse_power  195 non-null    float64
 11  peak_rpm     195 non-null    float64
 12  city_mpg     195 non-null    int64  
 13  highway_mpg  195 non-null    int64  
 14  price        195 non-null    int64  
dtypes: float64(9), int64(5), object(1)
memory usage: 24.4+ KB


In [62]:
df2.head()

Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,gas,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,gas,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,gas,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,gas,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


## 2.1 **Replace ['gas', 'diesel'] string values to [0, 1]**

In [63]:
#Your code
# Assuming 'fuel_type' is the first column (index 0)
dict_replace = {'gas': 0, 'diesel': 1}
df2.iloc[:, 0] = df2.iloc[:, 0].replace(dict_replace)
df2.head()

Unnamed: 0,fuel_type,wheel_base,length,width,heights,curb_weight,engine_size,bore,stroke,comprassion,horse_power,peak_rpm,city_mpg,highway_mpg,price
0,0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,13495
1,0,88.6,168.8,64.1,48.8,2548,130,3.47,2.68,9.0,111.0,5000.0,21,27,16500
2,0,94.5,171.2,65.5,52.4,2823,152,2.68,3.47,9.0,154.0,5000.0,19,26,16500
3,0,99.8,176.6,66.2,54.3,2337,109,3.19,3.4,10.0,102.0,5500.0,24,30,13950
4,0,99.4,176.6,66.4,54.3,2824,136,3.19,3.4,8.0,115.0,5500.0,18,22,17450


## 2.2 : Define your X and y: your dependent variable is fuel_type, the rest of the variables are your independent variables

In [64]:
#your code
X = df2.iloc[:, df2.columns != 'fuel_type']  # Independent variables (excluding fuel_type)
y = df2['fuel_type']  # Dependent variable (fuel_type)


## 2.3 Split your data into training and testing set. Use test_size=0.3, random_state=746 !

In [65]:
#your code
from sklearn.model_selection import train_test_split

# Splitting the data into training and testing sets
X_train_new, X_test_new, y_train_new, y_test_new = train_test_split(X, y, test_size=0.3, random_state=746)

# Display the shapes of the resulting sets
print("X_train_new shape:", X_train_new.shape)
print("X_test_new shape:", X_test_new.shape)
print("y_train_new shape:", y_train_new.shape)
print("y_test_new shape:", y_test_new.shape)



X_train_new shape: (136, 14)
X_test_new shape: (59, 14)
y_train_new shape: (136,)
y_test_new shape: (59,)


# 3. Classification

### 3.1 Use Logistic regression to classify your data. Print/report your confusion matrix, classification report and AUC

In [66]:
#your code
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

# Assuming you have already split your data into X_train_new, X_test_new, y_train_new, y_test_new

# Label encoding for the target variable if it's categorical
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_new)
y_test_encoded = label_encoder.transform(y_test_new)

# Creating a pipeline for Logistic Regression
numeric_features = X_train_new.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# Logistic Regression model within a pipeline
logreg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=10000))
])

# Fitting the pipeline
logreg_pipeline.fit(X_train_new, y_train_encoded)

# Predictions
y_pred = logreg_pipeline.predict(X_test_new)

# Confusion matrix
conf_matrix = confusion_matrix(y_test_encoded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_test_encoded, y_pred)
print("\nClassification Report:")
print(class_report)

# Calculating AUC
auc = roc_auc_score(y_test_encoded, y_pred)
print("\nAUC Score:", auc)


Confusion Matrix:
[[50  0]
 [ 0  9]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


AUC Score: 1.0


### 3.2 Use Naive Bayes to classify your data. Print/report your confusion matrix, classification report and AUC

In [67]:
#your code
from sklearn.naive_bayes import GaussianNB

# Label encoding for the target variable if it's categorical
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_new)
y_test_encoded = label_encoder.transform(y_test_new)

# Creating a pipeline for Naive Bayes
numeric_features = X_train_new.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# Naive Bayes model within a pipeline
naive_bayes_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', GaussianNB())
])

# Fitting the pipeline
naive_bayes_pipeline.fit(X_train_new, y_train_encoded)

# Predictions
y_pred = naive_bayes_pipeline.predict(X_test_new)

# Confusion matrix
conf_matrix = confusion_matrix(y_test_encoded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_test_encoded, y_pred)
print("\nClassification Report:")
print(class_report)

# Calculating AUC
auc = roc_auc_score(y_test_encoded, y_pred)
print("\nAUC Score:", auc)


Confusion Matrix:
[[50  0]
 [ 0  9]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


AUC Score: 1.0


### 3.3 Use KNN to classify your data. First find the optimal k and than run you classification. Print/report your confusion matrix, classification report and AUC

In [68]:
#your code
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV

# Label encoding for the target variable if it's categorical
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_new)
y_test_encoded = label_encoder.transform(y_test_new)

# Creating a pipeline for KNN
numeric_features = X_train_new.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# KNN model within a pipeline
knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier())
])

# Define the parameter grid
param_grid = {'classifier__n_neighbors': range(1, 21)}  # Trying K values from 1 to 20

# Perform GridSearchCV to find the best K
grid_search = GridSearchCV(knn_pipeline, param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_new, y_train_encoded)

# Get the best K value
best_k = grid_search.best_params_['classifier__n_neighbors']
print("Best K value:", best_k)

# Use the best K to classify the data
best_knn_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', KNeighborsClassifier(n_neighbors=best_k))
])

best_knn_pipeline.fit(X_train_new, y_train_encoded)

# Predictions
y_pred = best_knn_pipeline.predict(X_test_new)

# Confusion matrix
conf_matrix = confusion_matrix(y_test_encoded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_test_encoded, y_pred)
print("\nClassification Report:")
print(class_report)

# Calculating AUC
auc = roc_auc_score(y_test_encoded, y_pred)
print("\nAUC Score:", auc)


Best K value: 1
Confusion Matrix:
[[50  0]
 [ 0  9]]

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        50
           1       1.00      1.00      1.00         9

    accuracy                           1.00        59
   macro avg       1.00      1.00      1.00        59
weighted avg       1.00      1.00      1.00        59


AUC Score: 1.0


### 3.4 Choose one: SVM or Random Forest to classify your data. Print/report your confusion matrix, classification report and AUC

In [69]:
#your code
from sklearn.ensemble import RandomForestClassifier

# Label encoding for the target variable if it's categorical
label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train_new)
y_test_encoded = label_encoder.transform(y_test_new)

# Creating a pipeline for Random Forest
numeric_features = X_train_new.select_dtypes(include=['float64', 'int64']).columns.tolist()
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features)])

# Random Forest model within a pipeline
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))  # Adjust parameters as needed
])

# Fitting the pipeline
rf_pipeline.fit(X_train_new, y_train_encoded)

# Predictions
y_pred = rf_pipeline.predict(X_test_new)

# Confusion matrix
conf_matrix = confusion_matrix(y_test_encoded, y_pred)
print("Confusion Matrix:")
print(conf_matrix)

# Classification report
class_report = classification_report(y_test_encoded, y_pred)
print("\nClassification Report:")
print(class_report)

# Calculating AUC
auc = roc_auc_score(y_test_encoded, y_pred)
print("\nAUC Score:", auc)


Confusion Matrix:
[[50  0]
 [ 1  8]]

Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99        50
           1       1.00      0.89      0.94         9

    accuracy                           0.98        59
   macro avg       0.99      0.94      0.97        59
weighted avg       0.98      0.98      0.98        59


AUC Score: 0.9444444444444444


### 3.5 Compare your results and comment on your findings. Which one(s) did the best job? What could have been the problem with the ones that did not work? etc.

#your answer
- **Logistic Regression (3.1):**
  - **Pros:** Achieved perfect scores across all metrics (precision, recall, f1-score) and an AUC of 1.0, indicating flawless performance.
  - **Cons:** No apparent issues observed based on the provided output.

- **Naive Bayes (3.2):**
  - **Pros:** Also demonstrated perfect scores across all metrics and an AUC of 1.0, signifying excellent performance.
  - **Cons:** No apparent issues based on the provided output.

- **KNN (3.3):**
  - **Pros:** Achieved perfect scores across all metrics and an AUC of 1.0, indicating outstanding performance.
  - **Cons:** No apparent issues based on the provided output.

- **Random Forest (3.4):**
  - **Pros:** High accuracy and strong precision for both classes. However, a slight drop in recall for class 1 resulted in a single misclassification.
  - **Cons:** Slightly lower recall for class 1 compared to other models.

### Comparison and Observations:
- **Top Performers:** Logistic Regression, Naive Bayes, and KNN all displayed flawless performance, achieving perfect scores across metrics and AUC of 1.0.
- **Minor Variance:** Random Forest, while highly accurate, exhibited a slight decrease in recall for class 1, leading to a single misclassification.
- **Insights:** Logistic Regression, Naive Bayes, and KNN performed equally well, exhibiting optimal accuracy without any misclassifications. In contrast, Random Forest experienced a minor challenge in precisely identifying instances from class 1, resulting in a single misclassification.

### Analysis:
- **Strengths:** Logistic Regression, Naive Bayes, and KNN demonstrated robustness and accuracy without misclassifications.
- **Potential Weakness:** Random Forest showed a minor issue in correctly identifying instances from class 1, leading to a single misclassification.

In conclusion, all models exhibited impressive performance. Logistic Regression, Naive Bayes, and KNN achieved flawless accuracy. However, Random Forest, while highly accurate overall, encountered a slight challenge in accurately identifying instances from class 1. Overall, each model displayed strong predictive capabilities, with slight variations in performance on this dataset


## 4. Bonus question (5 extra points)
**Try to fix the inbalanced nature of the data with a tool from the lecture. Run one of the classification methods (preferable one that "failed" before) and see if you get better results.**

In [70]:
#your code
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
# Apply SMOTE only to the training set to avoid data leakage
smote = SMOTE(random_state=42)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_new, y_train_new)

# Initialize and fit the Random Forest classifier
rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
rf_classifier.fit(X_train_resampled, y_train_resampled)

# Predictions on the test set
y_pred_test = rf_classifier.predict(X_test_new)

# Confusion matrix
confusion_matrix_test = confusion_matrix(y_test_new, y_pred_test)
print("Confusion Matrix (Test Set):")
print(confusion_matrix_test)

# Classification report
classification_report_test = classification_report(y_test_new, y_pred_test)
print("\nClassification Report (Test Set):")
print(classification_report_test)

# Calculating AUC
auc_score_test = roc_auc_score(y_test_new, y_pred_test)
print("\nAUC Score (Test Set):", auc_score_test)


ValueError: Unknown label type: unknown. Maybe you are trying to fit a classifier, which expects discrete classes on a regression target with continuous values.