## DATATHON CHALLENGE

### _Predicting the Energ_Kcal using regression model_


The model predicts the Energ_Kcal based on the top 30 numerical predictors in the dataframe. Multiple categorical columns were dropped as the information to calculate energy was already provided in the form of the ingredients. The columns with more than 40% missing values were removed from the dataset and the missing values were imputed using KNN imputer for a better result. The outliers were not removed from the dataset due to the lack of domain knowledge. The features are mostly extremely right skewed. As the dataset had multiple variables with differing scale, data was normalized using the MinMaxScaler(). Multiple regression models were fit on the training data to evaluate the model performance. Even though the linear regression model gave the least MAE, the XGB regession model is selected as the final model as it is robust to outliers which we haven't removed from the dataset.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

data= pd.read_excel(r"C:\Users\saras\Downloads\ABBREV.xlsx\ABBREV.xlsx")
print("Data size: ",data.shape)
data.head()

In [None]:
#checking for the datatype and missing value details
print(data.info())
data.isnull().sum()



In [None]:
data_copy=data

#Dropping columns with more than 40% missing values
data_copy= data_copy.drop(data_copy.columns[data_copy.isnull().mean()>0.4],axis=1)
data_copy.info()

In [None]:
#Dropping the short description as we are interested in calculating the calories for which all the ingredient details are 
#given. The same is being done for GmWt_Desc1 since we have the GmWt which gives the weight of the product
#'NDB_No is dropped as it is similar to index and doesnt hold any meaning
data_copy=data_copy.drop(['Shrt_Desc','GmWt_Desc1','NDB_No'],axis=1)

In [None]:
#Imputing missing values using KNN imputer
from sklearn.impute import KNNImputer
X=data_copy.values
imputer=KNNImputer()
imputer.fit(X)
#Out y do not have any missing values and hence not fitting and transforming y
X_imputed=imputer.transform(X)
X_imputed.shape
data_copy= pd.DataFrame(X_imputed, columns= data_copy.columns)
data_copy

In [None]:
#We are building a model without removing the outliers
#Checking skewness and plotting histogram 
skew=data_copy.skew()
skew=skew.loc[lambda x:(x>5) | (x<-5)]
print("Skewness before transformation: ",skew)

data_copy.hist(figsize=(18,20))
plt.show()


In [None]:
#As the data is highly skewed and on different scales, we will normalize the data using Minmaxscaler() 
#after splitting the data
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

X=data_copy.drop(['Energ_Kcal'],axis=1).values
y=data_copy['Energ_Kcal'].values
X_train,X_test,y_train,y_test= train_test_split(X,y,test_size=0.333,random_state=42)
y_test.shape

In [None]:
#Transforming independent variables using MinMaxScaler() as our data has variables having different scale
scaler= MinMaxScaler()
scaler.fit(X_train)
X_train= scaler.transform(X_train)
X_test= scaler.transform(X_test)

In [None]:
#Transforming target variables using MinMaxScaler()
scaler= MinMaxScaler()
scaler.fit(y_train.reshape(-1,1))
y_train=scaler.transform(y_train.reshape(-1,1))
y_test= scaler.transform(y_test.reshape(-1,1))
y_test.shape

In [None]:
#Selecting the top 30 features for modelling with SelectKBest
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression

fs = SelectKBest(score_func=f_regression, k=30)
fs.fit(X_train, y_train.ravel())
X_train_fs = fs.transform(X_train)
X_test_fs = fs.transform(X_test)

In [None]:
pip install xgboost

In [None]:
#Modelling with multiple algorithms to select the best

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

models= [('LR', LinearRegression()), ('DTR',DecisionTreeRegressor()),('SVM',SVR()),('XG',XGBRegressor())]
kfold = KFold(n_splits=10, shuffle=True, random_state=42)
for name,model in models:
    cv_results= cross_val_score(model,X_train_fs,y_train.ravel(),cv=kfold,scoring='neg_mean_absolute_error')
    cv_results= np.absolute(cv_results)
    print("{}: Mean error:{},\t\tStd deviation of error:{}".format(name, cv_results.mean().round(4), cv_results.std().round(4)))

In [None]:
#Choosing XGB as the final model as it has a low MAE and are also robust to outliers

model= XGBRegressor()
model.fit(X_train_fs,y_train.ravel())
ypred= model.predict(X_test_fs)
print(mean_squared_error(y_test,ypred))