
**Name:** Clothing Size Predictor

**Author:** Sharome Burton

**Date:** 07/18/2021

**Description:** Machine learning model used to predict women's clothing sizes based on historical data on age, weight and height

## 1. Problem definition
> How well can we predict the appropriate clothing size of for an individual, given age, weight and height?

## 2. Data
The data file for this project `final_test.csv` can be downloaded from the clothing-size prediction dataset on Kaggle : https://www.kaggle.com/tourist55/clothessizeprediction

   
## 3. Evaluation 

> **Goal:** Predict the clothing size of an individual with >95% accuracy

## 4. Features

* weight (kg)
* age (years)
* height (cm)

### Import libraries

In [None]:
# Regular EDA (exploratory data analysis) and plotting libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

# Models from Scikit-learn
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

# Model Evaluations
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
from sklearn.metrics import confusion_matrix, classification_report

### Import data

In [None]:
df_raw = pd.read_csv("../input/clothessizeprediction/final_test.csv")
df_raw

In [None]:
df_raw.info()

### Exploratory data analysis (EDA)

In [None]:
df_raw.describe()

In [None]:
# Number of occurences for each size (target variable)
df_raw["size"].value_counts()

In [None]:
# Number of occurences for each size (target variable)
sns.countplot(x=df_raw["size"])

Size `M` is the most common

In [None]:
# Age distribution
sns.displot(df_raw["age"])

Large fraction of population seems to be around the ages of `25 to 35 years old`

In [None]:
# Weight distribution
sns.displot(df_raw["weight"])

In [None]:
# height distribution
sns.displot(df_raw["height"])

Population weight and height seem to show reasonable normal distributions

### Removing outliers (z-score)

In [None]:
# Removing Outliers
dfs = []
sizes = []
for size_type in df_raw['size'].unique():
    sizes.append(size_type)
    ndf = df_raw[['age','height','weight']][df_raw['size'] == size_type]
    zscore = ((ndf - ndf.mean())/ndf.std())
    dfs.append(zscore)
    
for i in range(len(dfs)):
    dfs[i]['age'] = dfs[i]['age'][(dfs[i]['age']>-3) & (dfs[i]['age']<3)]
    dfs[i]['height'] = dfs[i]['height'][(dfs[i]['height']>-3) & (dfs[i]['height']<3)]
    dfs[i]['weight'] = dfs[i]['weight'][(dfs[i]['weight']>-3) & (dfs[i]['weight']<3)]

for i in range(len(sizes)):
    dfs[i]['size'] = sizes[i]
df_raw = pd.concat(dfs)
df_raw.head()

### Filling missing data

In [None]:
# Check for missing values
df_raw.isna().sum()

In [None]:
# Filling missing data
df_raw["age"] = df_raw["age"].fillna(df_raw['age'].median())
df_raw["height"] = df_raw["height"].fillna(df_raw['height'].median())
df_raw["weight"] = df_raw["weight"].fillna(df_raw['weight'].median())

In [None]:
# Mapping clothes size from strings to numeric
df_raw['size'] = df_raw['size'].map({"XXS": 1,
                                     "S": 2,
                                     "M" : 3,
                                     "L" : 4,
                                     "XL" : 5,
                                     "XXL" : 6,
                                     "XXXL" : 7})

In [None]:
# Check for missing values
df_raw.isna().sum()

In [None]:
df_raw

### Feature Engineering
We will create two new features to help model training effectiveness:
* `bmi` (body-mass index) - medically accepted measure of obesity
* `weight-squared` - value increases exponentially with increase in `weight`

In [None]:
df_raw["bmi"] = df_raw["height"]/df_raw["weight"]
df_raw["weight-squared"] = df_raw["weight"] * df_raw["weight"]

In [None]:
df_raw

### Correlation matrix

In [None]:
corr = sns.heatmap(df_raw.corr(), annot=True)

Clothing `size` seems much more highly dependent on `weight` than `age` or `height`, and seems to be have a strong inverse correlation with `bmi`

### Splitting data into training and validation datasets
The target variable is clothing `size`, and we will let the validation set be 10% of the total population.

In [None]:
# Features
X = df_raw.drop("size", axis=1)

# Target
y = df_raw["size"]

In [None]:
X.head()

In [None]:
y.head()

In [None]:
# Splitting data into training set and validation set

X_train, X_test, y_train, y_test, = train_test_split(X,y, test_size=0.10)

In [None]:
len(X_train), len(X_test)

### Training Model
We will try:
* Logistic Regression
* K-Nearest Neighbors
* Random Forest Classifier
* Decision Tree Classifier

In [None]:
# Put models in a dictionary
models = {"Logistic Regression": LogisticRegression(),
         "KNN": KNeighborsClassifier(),
         "Random Forest": RandomForestClassifier(),
         "Decision Tree": DecisionTreeClassifier()}

# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
   
    """
   Fits and evaluates given machine learning models.
   models: a dict of different Scikit_Learn machine learning models
   X_train: training data (no labels)
   X_test: testing data (no labels)
   y_train: training labels
   y_test: test labels
   """ 
    # Set random seed
    np.random.seed(18)
    # Make a dictionary to keep model scores
    model_scores = {}
    # Loop through models
    for name, model in models.items():
        # Fit model to data
        model.fit(X_train, y_train)
        # Evaluate model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)

    return model_scores

In [None]:
# model_scores = fit_and_score(models,X_train,X_test,y_train,y_test)

# model_scores

In [None]:
# model_compare = pd.DataFrame(model_scores, index=["accuracy"])
# model_compare.T.plot.bar();

### Model evaluation
We will continue with the DecisionTreeClassifier model, which scored highest in initial tests with `99.9749%` accuracy.


In [None]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)


In [None]:
# Confusion matrix
print(confusion_matrix(y_test, y_pred))

In [None]:
# Classification report
print(classification_report(y_test, y_pred))

### Conclusion

The trained model shows a weighted average accuracy of `99.9%`, so the evaluation metric of >95% has been met.

### Feature Importance

In [None]:
# Find feature importance of ideal model
len(model.feature_importances_)

In [None]:
model.feature_importances_

In [None]:
# Helper function for plotting feature importance
def plot_features(columns, importances,n=20):
    df = (pd.DataFrame({"features": columns,
                       "feature_importances": importances})
         .sort_values("feature_importances", ascending=False)
         .reset_index(drop=True))
    # Plot dataframe
    fix, ax = plt.subplots()
    ax.barh(df["features"][:n], df["feature_importances"][:20])
    ax.set_ylabel("Features")
    ax.set_xlabel("Feature Importance")
    ax.invert_yaxis()
    
plot_features(X_train.columns, model.feature_importances_)

`weight` seems to be an extremely significant determinant for the model relative to the other features