## <center> Assignment 2 </center>

#### Name: Tamal Chakroborty
#### Student ID:245830440

You are provided with a training dataset and a testing dataset for a binary classification problem with labels {0,1}. The last column of the training set is the label, while the test dataset contains only attributes.

Train an effective classifier using the training dataset. You are free to choose your data processing approach, the classifier type, and tune the classifier's parameters as needed. You can use the sklearn package in Python for model implementation. 

Make predictions on the testing dataset and generate a file containing only one column of labels (predicted 0 or 1), in the same order as the testing dataset.

Please submit your implementation code and the predicted output file as two separate files (not in a zip) in the names "A2.ipynb" and "prediction.txt". Your assignment will be evaluated based on the performance of your model, specifically its F1-score, among other criteria.

In [456]:
import pandas as pd
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import f1_score, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE


df_train = pd.read_csv('A2_data/train.csv',sep=',',index_col=0) 
df_test_attribute_only = pd.read_csv('A2_data/test_attribute.csv',sep=',',index_col=0) 

In [458]:
## your code here
df_train.sample(5)

Unnamed: 0,0,1,2,3,4,5,6,7,8
233,0.28,0.48,0.36,0.16,0.5,0.0,0.53,0.22,0
537,0.53,0.53,0.6,0.13,0.5,0.0,0.49,0.22,0
387,0.35,0.21,1.0,0.8,0.5,0.0,0.13,0.01,0
320,0.19,0.41,0.55,0.13,0.5,0.0,0.52,0.25,0
169,0.45,0.4,0.5,0.16,0.5,0.0,0.5,0.22,0


## To know the details of the data, we will check if there are any missing values, the range of the data, insignificant columns, and class blance

In [460]:
missing_values = {}
ranges = {}
outlier_info = {}
frequency_counts = {}

# Loop through each column
for column in df_train.columns:
    # Step 1: Count missing values
    missing_values[column] = df_train[column].isnull().sum()
    
    # Step 2: Calculate range (max - min)
    ranges[column] = df_train[column].max() - df_train[column].min()
    
    # Step 3: Outliers detection using IQR
    Q1 = df_train[column].quantile(0.25)
    Q3 = df_train[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = df_train[(df_train[column] < lower_bound) | (df_train[column] > upper_bound)]
    outlier_info[column] = len(outliers)
    frequency_counts[column] = df_train[column].value_counts()


# Display the results
print("Missing Values in Each Column:\n", missing_values)
print("\nRange of Each Column:\n", ranges)
print("\nNumber of Outliers in Each Column:\n", outlier_info)
print("\nFrequency of Unique Values in Each Column:")

# Print frequency counts for each column
for column, counts in frequency_counts.items():
    print(f"\nFrequency counts for column '{column}':\n{counts}")

Missing Values in Each Column:
 {'0': 0, '1': 0, '2': 0, '3': 0, '4': 0, '5': 0, '6': 0, '7': 0, '8': 0}

Range of Each Column:
 {'0': 0.73, '1': 0.86, '2': 0.79, '3': 0.8, '4': 0.5, '5': 0.83, '6': 0.59, '7': 0.73, '8': 1}

Number of Outliers in Each Column:
 {'0': 10, '1': 20, '2': 11, '3': 25, '4': 6, '5': 8, '6': 25, '7': 86, '8': 64}

Frequency of Unique Values in Each Column:

Frequency counts for column '0':
0
0.51    32
0.46    31
0.45    29
0.50    25
0.47    23
        ..
0.87     1
0.83     1
0.88     1
0.29     1
0.24     1
Name: count, Length: 70, dtype: int64

Frequency counts for column '1':
1
0.46    32
0.45    32
0.48    28
0.53    24
0.51    23
        ..
0.78     1
0.14     1
0.18     1
1.00     1
0.80     1
Name: count, Length: 68, dtype: int64

Frequency counts for column '2':
2
0.54    42
0.53    40
0.51    39
0.52    37
0.50    35
0.55    31
0.56    29
0.49    29
0.48    26
0.47    25
0.45    24
0.57    23
0.46    22
0.58    20
0.43    15
0.35    14
0.34    14
0.

#### From the data it is evident that there are outliers in the data and the classes are imbalance. Additionally, column 4 insignificant.

**To address outliers:** the tree based model(e.g., Random Forests, Decision Trees) can be better than distance based model for the data and we can use some approaches(e.g. Robust Scaler) to handle the outliers.

**To address class imbalance:** We have to use class weightining factors in the model and if it does not perform well enough we need to resample to balance the classes.

**Insignificant Columns:** Since column 4 is statistically insignificant, we can check the accuracy removing it and keeping it.

# Data preprocessing by removing columns and scaling

In [471]:
scaler = RobustScaler()
class_labels = df_train.iloc[:, -1]


## Removing column
filtered_df_train = df_train.drop(df_train.columns[[4]], axis=1)
filtered_features = filtered_df_train.iloc[:, :-1]
class_labels = filtered_df_train.iloc[:, -1]

filtered_X_train, filtered_X_test, filtered_y_train, filtered_y_test = train_test_split(filtered_features, class_labels, test_size=0.2, random_state=42)

# Fit on the training data
filtered_X_train_scaled = pd.DataFrame(scaler.fit_transform(filtered_X_train), columns=filtered_X_train.columns)
filtered_X_test_scaled = pd.DataFrame(scaler.transform(filtered_X_test), columns=filtered_X_test.columns)




## without Removing column
features = df_train.iloc[:, :-1]

scaler = RobustScaler()

X_train, X_test, y_train, y_test = train_test_split(features, class_labels, test_size=0.2, random_state=42)

# Fit on the training data
X_train_scaled = pd.DataFrame(scaler.fit_transform(X_train), columns=X_train.columns)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)

## We will test with Random forest and Logistic regression

In [489]:
model_rf = RandomForestClassifier(class_weight='balanced',random_state=42)
model_lr = LogisticRegression(class_weight='balanced', random_state=42)


# For filtered scaled data

# Random forest
model_rf.fit(filtered_X_train_scaled, filtered_y_train)
filtered_y_pred = model_rf.predict(filtered_X_test_scaled)
filtered_f1_RF = f1_score(filtered_y_test, filtered_y_pred, average='weighted')
print("Filtered scaled data -> F1-score for random forest:", filtered_f1_RF)

# Logistic regression
model_lr.fit(filtered_X_train_scaled, filtered_y_train)
filtered_y_pred = model_lr.predict(filtered_X_test_scaled)
filtered_f1_LR = f1_score(filtered_y_test, filtered_y_pred, average='weighted')
print("Filtered scaled data -> F1-score for logistic regression:", filtered_f1_LR)




# For unfiltered scaled data

# Random forest
model_rf.fit(X_train_scaled, y_train)
y_pred = model_rf.predict(X_test_scaled)
f1_RF = f1_score(y_test, y_pred, average='weighted')
print("Unfiltered scaled data -> F1-score for random forest:", f1)

# Logistic regression
model_lr.fit(X_train_scaled, y_train)
y_pred = model_lr.predict(X_test_scaled)
f1_LR = f1_score(y_test, y_pred, average='weighted')
print("Unfiltered scaled data -> F1-score for logistic regression:", f1)

Filtered scaled data -> F1-score for random forest: 0.961255898604444
Filtered scaled data -> F1-score for logistic regression: 0.9240248677485524
Unfiltered scaled data -> F1-score for random forest: 0.9694656488549618
Unfiltered scaled data -> F1-score for logistic regression: 0.9694656488549618


### It seems the filtering(removing insignificant column) has no effect or adverse effect on the classification. Additionally,  random forest is performing better in both of the cases.

**We can select the Random forest for prediction. Since the accuracy is not very good we can check with the SMOTE(Synthetic Minority Oversampling) to balance out the imbalance data** 

In [491]:
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train_scaled, y_train)

# Train on the resampled data
model_rf.fit(X_resampled, y_resampled)

# Make predictions
y_pred = model_rf.predict(X_test_scaled)

# Calculate F1-score
f1 = f1_score(y_test, y_pred, average='weighted')
print('Classification detais report of Random forest on unfiltered data with SMOTE \n', classification_report(y_test, y_pred))

Classification detais report of Random forest on unfiltered data with SMOTE 
               precision    recall  f1-score   support

           0       0.98      0.98      0.98       116
           1       0.87      0.87      0.87        15

    accuracy                           0.97       131
   macro avg       0.92      0.92      0.92       131
weighted avg       0.97      0.97      0.97       131



### Since the accuracy got improved, we can generate prodiction  for the given test data.

In [496]:
df_test_attribute_only_scaled = pd.DataFrame(scaler.transform(df_test_attribute_only), columns=df_test_attribute_only.columns)


test_attribute_prediction = model_rf.predict(df_test_attribute_only_scaled)

pd.DataFrame(test_attribute_prediction, columns=['Prediction']).to_csv("prediction.txt", header= True, index= False)

### Briefly describe your approach in the following cell.

#### First, we analyzed the data and considered attribute selection and preprocessing stretegies as explained above. Additionally, based on the data we decided the suitable classifiers and sampling requirements. Each of the steps have been decribed with reasoning above.