# Tabular Kaggle Project

For Friday March 8 you must submit your first attempt at the Kaggle Project in the form of a notebook.  Need to see:


## Define Project

* Provide Project link.
* Short paragraph describing the challenge. 
* Briefly describe the data.
* What type of Machine Learning? Supervised Classification (binary or multiclass) or Regression? 


## Project Definition

Steel Plate Defect Prediction
* https://www.kaggle.com/competitions/playground-series-s4e3/overview

* The challenge is to "predict the probability of each of the 7 binary targets" according to the competion page. This means that the challenge is a supervised 7-way multiclass classification problem.

* The project gives a synthetic training dataset of 19219 cases with 34 columns, 27 of which are features. Both the train and test datasets were generated using a deep learning model based on the Steel Plates Faults dataset from UCI. Feature distributions of the synthetic datasets have been slightly altered from the original. 

## Data Loading and Initial Look

* Load the data. 
* Count the number of rows (data points) and features.
* Any missing values? 

* Make a table, where each row is a feature or collection of features:
    * Is the feature categorical or numerical
    * What values? 
        * e.g. for categorical: "0,1,2"
        * e.g. for numerical specify the range
    * How many missing values
    * Do you see any outliers?
        * Define outlier.
        
* For classification is there class imbalance?
* What is the target:
    * Classification: how is the target encoded (e.g. 0 and 1)?
    * Regression: what is the range?

In [5]:
# Load the data

import numpy as np
import pandas as pd

data = pd.read_csv('../../../playground-series-s4e3/train.csv', index_col='id')
data.head()

Unnamed: 0_level_0,X_Minimum,X_Maximum,Y_Minimum,Y_Maximum,Pixels_Areas,X_Perimeter,Y_Perimeter,Sum_of_Luminosity,Minimum_of_Luminosity,Maximum_of_Luminosity,...,Orientation_Index,Luminosity_Index,SigmoidOfAreas,Pastry,Z_Scratch,K_Scatch,Stains,Dirtiness,Bumps,Other_Faults
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,584,590,909972,909977,16,8,5,2274,113,140,...,-0.5,-0.0104,0.1417,0,0,0,1,0,0,0
1,808,816,728350,728372,433,20,54,44478,70,111,...,0.7419,-0.2997,0.9491,0,0,0,0,0,0,1
2,39,192,2212076,2212144,11388,705,420,1311391,29,141,...,-0.0105,-0.0944,1.0,0,0,1,0,0,0,0
3,781,789,3353146,3353173,210,16,29,3202,114,134,...,0.6667,-0.0402,0.4025,0,0,1,0,0,0,0
4,1540,1560,618457,618502,521,72,67,48231,82,111,...,0.9158,-0.2455,0.9998,0,0,0,0,0,0,1


In [3]:
# Count number of rows and features

num_rows, num_col = data.shape

print("Number of rows: ", num_rows)
print("Number of features: ", num_col-1)

Number of rows:  19219
Number of features:  34


In [7]:
# Check for missing values

missing_values = data.isnull().any()

# Print columns with missing values, if any
if missing_values.any():
    print("Columns with missing values:")
    print(missing_values[missing_values])
else:
    print("No missing values in the dataset.",end="\n\n")
    
data.info()

No missing values in the dataset.

<class 'pandas.core.frame.DataFrame'>
Int64Index: 19219 entries, 0 to 19218
Data columns (total 34 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   X_Minimum              19219 non-null  int64  
 1   X_Maximum              19219 non-null  int64  
 2   Y_Minimum              19219 non-null  int64  
 3   Y_Maximum              19219 non-null  int64  
 4   Pixels_Areas           19219 non-null  int64  
 5   X_Perimeter            19219 non-null  int64  
 6   Y_Perimeter            19219 non-null  int64  
 7   Sum_of_Luminosity      19219 non-null  int64  
 8   Minimum_of_Luminosity  19219 non-null  int64  
 9   Maximum_of_Luminosity  19219 non-null  int64  
 10  Length_of_Conveyer     19219 non-null  int64  
 11  TypeOfSteel_A300       19219 non-null  int64  
 12  TypeOfSteel_A400       19219 non-null  int64  
 13  Steel_Plate_Thickness  19219 non-null  int64  
 14  Edges_Index        

In [40]:
# Table of column info

feature_info = []

# Detect outliers function
def idoutliers(column, dataset=data):
    # use IQR to ID possible outliers
    #float(column)
    Q1 = dataset[column].quantile(0.25)
    Q3 = dataset[column].quantile(0.75)
    
    IQR = Q3 - Q1
    
    lwr_bnd = Q1 - 1.5*IQR
    upr_bnd = Q3 + 1.5*IQR
    
    outliers = dataset[(dataset[column] < lwr_bnd) | (dataset[column] > upr_bnd)]
    
    return len(outliers)

# Iterate through each column dataset
for column in data.columns:
    # Determine feature type 
    if set(data[column].unique()) == {0,1}:
        feature_type = 'Cat'
        feature_values = [0,1]
        outliers = 0
    else:
        feature_type = 'Num'
        feature_values = f"{data[column].min()}-{data[column].max()}"
        outliers = idoutliers(column)

    # Count missing values
    missing_values = data[column].isnull().sum()
    # Append feature information to the list
    feature_info.append([column, feature_type, feature_values, missing_values, outliers])
# Create a DataFrame from the list of feature information
feature_table = pd.DataFrame(feature_info, columns=['Feature', 'Type', 'Vals', 'Miss Vals', 'Outliers'])

print(feature_table)

                  Feature Type            Vals  Miss Vals  Outliers
0               X_Minimum  Num          0-1705          0         0
1               X_Maximum  Num          4-1713          0         0
2               Y_Minimum  Num   6712-12987661          0      1118
3               Y_Maximum  Num   6724-12987692          0      1112
4            Pixels_Areas  Num        6-152655          0      3722
5             X_Perimeter  Num          2-7553          0      3717
6             Y_Perimeter  Num           1-903          0      2785
7       Sum_of_Luminosity  Num    250-11591414          0      3826
8   Minimum_of_Luminosity  Num           0-196          0       211
9   Maximum_of_Luminosity  Num          39-253          0      1292
10     Length_of_Conveyer  Num       1227-1794          0         0
11       TypeOfSteel_A300  Cat          [0, 1]          0         0
12       TypeOfSteel_A400  Cat          [0, 1]          0         0
13  Steel_Plate_Thickness  Num          40-300  

## Data Visualization

* For classification: compare histogram every feature between the classes. Lots of examples of this in class.
* For regression: 
    * Define 2 or more class based on value of the regression target.
        * For example: if regression target is between 0 and 1:
            * 0.0-0.25: Class 1
            * 0.25-0.5: Class 2
            * 0.5-0.75: Class 3
            * 0.75-1.0: Class 4
    * Compare histograms of the features between the classes.
        
* Note that for categorical features, often times the information in the histogram could be better presented in a table.    
* Make comments on what features look most promising for ML task.

## Data Cleaning and Preperation for Machine Learning

* Perform any data cleaning. Be clear what are you doing, for what feature. 
* Determinine if rescaling is important for your Machine Learning model.
    * If so select strategy for each feature.
    * Apply rescaling.
* Visualize the features before and after cleaning and rescaling.
* One-hot encode your categorical features.

## Machine Learning


### Problem Formulation

* Remove unneed columns, for example:
    * duplicated
    * categorical features that were turned into one-hot.
    * features that identify specific rows, like ID number.
    * make sure your target is properly encoded also.
* Split training sample into train, validation, and test sub-samples.

### Train ML Algorithm

* You only need one algorithm for now. You can do more if you like.
* For now, focus on making it work, rather than best result.
* Try to get a non-trivial result.

### Evaluate Performance on Validation Sample

* Compute the usual metric for your ML task.
* Compute the score for the kaggle challenge.

### Apply ML to the challenge test set

