## **DMBI Project**
----------------------------------------------------------

### Gym Crowd Size Prediction

*Given data about a campus gym, we predict the number of people that will be at the gym at a given time. We have used Linear Regression model to make our predictions.*

Project by Shloka Bhatt (191310132028) and Bhavya Thakkar (191310132022)

In [35]:
#For working with data
import numpy as np
import pandas as pd

#For Preprocessing
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

#Models to be used
from sklearn.linear_model import Ridge
from sklearn.neural_network import MLPRegressor
from sklearn.ensemble import RandomForestRegressor

import warnings
warnings.filterwarnings(action='ignore')

In [36]:
data = pd.read_csv('../input/crowdedness-at-the-campus-gym/data.csv')

In [37]:
data

In [38]:
data.info()

## **Preprocessing**

In [39]:
def preprocess_inputs(df):
    df = df.copy()
    
    # Extract Date Features
    df['date'] = pd.to_datetime(df['date'])
    df['month'] = df['date'].apply(lambda x: x.month)
    df['day'] = df['date'].apply(lambda x: x.day)
    df['hour'] = df['date'].apply(lambda x: x.hour)
    df['minute'] = df['date'].apply(lambda x: x.minute)
    df = df.drop('date', axis=1)
    
    #Split df into x and y
    y = df['number_people']
    X = df.drop('number_people', axis=1)
    
    #Train-test split
    X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=1)
    
    #Scale X
    scaler = StandardScaler()
    scaler.fit(X_train)
    X_train = pd.DataFrame(scaler.transform(X_train), index=X_train.index, columns=X_train.columns)
    X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns)
    
    return X_train, X_test, y_train, y_test

In [40]:
X_train, X_test, y_train, y_test = preprocess_inputs(data)

In [41]:
X_train

In [42]:
y_train

## **Training**

In [43]:
models = {
    "Linear Regression (Ridge)": Ridge(),
    "           Neural Network": MLPRegressor(),
    "            Random Forest": RandomForestRegressor()
}

for name, model in models.items():
    model.fit(X_train, y_train)
    print(name + " trained.")

## **Results**

In [45]:
def get_rmse(y_test, y_pred):
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    return rmse

def get_r2(y_test, y_pred):
    r2 = 1 - (np.sum((y_test - y_pred)**2) / np.sum((y_test - y_test.mean())**2))
    return r2

In [48]:
for name, model in models.items():
    y_pred = model.predict(X_test)
    rmse = get_rmse(y_test, y_pred)
    print(name + " RMSE: {:.2f}".format(rmse))

*In Random Forest, we get RMSE equal to 6.6 i.e., on average we are off by 6.6 people which is fairly good.
Whereas in other two models, we see a large gap.
Hence our Random Forest Model is out-performing other models.*

In [49]:
for name, model in models.items():
    y_pred = model.predict(X_test)
    r2 = get_r2(y_test, y_pred)
    print(name + " R^2: {:.5f}".format(r2))

*In Random Forest, we get R^2 equal to 0.91 i.e., there is about 91% reduction in error between the baseline model and our model.*