## Predict the target

#### **Problem Statement:**
An airline company has launched a new medical kit that a customer can buy during their check-in. After the launch, the company collected data from all the passengers that have bought this kit. They want to analyze the usefulness of this kit.

#### **Task:**
Your task is to build a model that predicts the usefulness of the medical kit. 

**Note:** `0` represents that the kit is ***useful*** and `1` represents that the kit is ***not useful***.

**Dataset description:** The data folder consists of two **.csv** files, **train.csv** (***6736 x 10***) and **test.csv** (***2164 x 9***).

The dataset consists of the following columns:

* `ID` -> Represents a unique identification
* `Distributor` -> Represents the distributor's code
* `Product` -> Represents a product's code
* `Duration` -> Represents the time taken to reach a destination
* `Destination` -> Represents a destination's code
* `Sales` -> Represents a sale price
* `Commission` -> Represents a commission charged by the distributor
* `Gender` -> Represents the gender of a passenger
* `Age` -> Represents the age of a passenger
* `Target` -> Represents the target column {**0**:`'Useful'`, **1**:`'Not useful'`}

**Evaluation metric:**
`score = 100*metrics.f1_score(actual, predicted, average = "weighted")`

Write a program that generates and writes predictions for the given data from **test.csv** into a .csv file. The Index is `"ID"` and the target is the `"Target"` column. 

Ensure that the file contains the correct names of columns as provided in the **sample_submission.csv** file. Ensure that the file contains the correct index values as per the **test.csv** file. The size of the submission file must be ***2164 x 2***.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score

### Load the datasets

In [2]:
# Load the datasets
train_df = pd.read_csv("C:\\Users\\Vaishob\\Desktop\\data\\train.csv")
test_df = pd.read_csv("C:\\Users\\Vaishob\\Desktop\\data\\test.csv")
sample_submission = pd.read_csv("C:\\Users\\Vaishob\\Desktop\\data\\sample_submission.csv")

### Initial Exploration

In [4]:
# Initial exploration
print(train_df.head())

                         ID  Distributor  Product  Duration  Destination  \
0      fffe3800370038003900            7        1        22          122   
1  fffe34003200370037003500            7        1        26           52   
2  fffe32003100320030003200            7       10        15           83   
3  fffe34003400310037003000            8       25        24           55   
4  fffe32003400390038003000            6       16        12          122   

   Sales  Commission  Gender  Age  Target  
0   31.0        0.00     NaN   20       0  
1   22.0        0.00     NaN   36       0  
2   63.0        0.00     NaN   34       0  
3   62.0       24.80     0.0  118       0  
4   19.8       11.88     NaN   26       0  


In [5]:
print(train_df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6736 entries, 0 to 6735
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   ID           6736 non-null   object 
 1   Distributor  6736 non-null   int64  
 2   Product      6736 non-null   int64  
 3   Duration     6736 non-null   int64  
 4   Destination  6736 non-null   int64  
 5   Sales        6736 non-null   float64
 6   Commission   6736 non-null   float64
 7   Gender       2032 non-null   float64
 8   Age          6736 non-null   int64  
 9   Target       6736 non-null   int64  
dtypes: float64(3), int64(6), object(1)
memory usage: 526.4+ KB
None


### Preprocessing
* Separate numeric & non-numeric columns
* Remove 'Target' from numeric columns since it's not in the test set
* For numeric columns, fill missing values with median
* For non-numeric columns, fill missing values with mode 

In [10]:
# Preprocessing

# Separate numeric and non-numeric columns
numeric_cols = train_df.select_dtypes(include=['number']).columns
non_numeric_cols = train_df.select_dtypes(exclude=['number']).columns

# Remove 'Target' from numeric columns since it's not in the test set
numeric_cols = numeric_cols.drop('Target')

# Fill missing values for numeric columns with the median
train_df[numeric_cols] = train_df[numeric_cols].fillna(train_df[numeric_cols].median())
test_df[numeric_cols] = test_df[numeric_cols].fillna(test_df[numeric_cols].median())

# Fill missing values for non-numeric columsn with the mode
train_df[non_numeric_cols] = train_df[non_numeric_cols].fillna(train_df[non_numeric_cols].mode().iloc[0])
test_df[non_numeric_cols] = test_df[non_numeric_cols].fillna(train_df[non_numeric_cols].mode().iloc[0])

In [11]:
# Handle categorical features with LabelEncoder
label_encoders = {}
for column in ['Distributor', 'Product', 'Destination', 'Gender']:
    le = LabelEncoder()
    train_df[column] = le.fit_transform(train_df[column])
    test_df[column] = test_df[column].map(lambda s: -1 if s not in le.classes_ else le.transform([s])[0])
    label_encoders[column] = le

### Model Training
* Separate features and target
* Train-test split for model evaluation
* Model Training
* Evaluate using validation set
* Train on full dataset and predict on test set

In [12]:
# Separate featuers and target
X = train_df.drop(columns=['ID', 'Target'])
y = train_df['Target']
X_test = test_df.drop(columns=['ID'])

# Train-test split for model evalutaion
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

# Model trainign
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Evalueate on validation set
val_predictions = model.predict(X_val)
f1 = f1_score(y_val, val_predictions, average='weighted')
print(f"Validation F1 Score: {f1}")

# Train on the full dataset and predict on test set
model.fit(X, y)
test_predictions = model.predict(X_test)

# Prepare the submission file
submission = pd.DataFrame({
    'ID': test_df['ID'],
    'Target': test_predictions
})

# Save the submisison file
submission.to_csv("C:\\Users\\Vaishob\\Desktop\\data\\submission.csv", index=False)

Validation F1 Score: 0.928110934104848
