# Restaurant revenue prediction
*****
Champs de données
**Id** : Id du restaurant.
 
**Open Date** : date d'ouverture d'un restaurant

**City** : Ville où se trouve le restaurant. Notez qu'il y a unicode dans les noms.

**City Group** : Type de ville. Grandes villes, ou Autre.

**Type** : Type de restaurant. FC : Food Court, IL : Inline, DT : Drive Thru, MB : Mobile

**P1, P2 - P37** : Il existe trois catégories de ces données obscurcies. Les données démographiques sont recueillies auprès de fournisseurs tiers dotés de systèmes SIG. Ceux-ci incluent la population dans une zone donnée, la répartition par âge et par sexe, les échelles de développement. Les données immobilières concernent principalement le m2 de l'emplacement, la façade avant de l'emplacement, la disponibilité des parkings. Les données commerciales incluent principalement l'existence de points d'intérêt, notamment des écoles, des banques, d'autres opérateurs de QSR.

**Revenue** : la colonne des revenus indique un revenu (transformé) du restaurant au cours d'une année donnée et est la cible d'une analyse prédictive. Veuillez noter que les valeurs sont transformées afin qu'elles ne correspondent pas à des valeurs réelles en dollars.

# 1.1-Analyse exploratoire et visualisation

1-a la détéction des valeurs manquantes

In [None]:
import pandas as pd
import numpy as np
import math
from sklearn.model_selection import train_test_split,KFold
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
import seaborn as sns
from sklearn.linear_model import (LinearRegression, Ridge, Lasso,LogisticRegression)
from sklearn.ensemble import RandomForestRegressor
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

In [None]:
df_train = pd.read_csv('../input/restaurant-revenue-prediction/train.csv.zip')
df_test = pd.read_csv('../input/restaurant-revenue-prediction/test.csv.zip')
df_train.head()

In [None]:
df_test.head()

In [None]:
df_train.columns

In [None]:
df_train.isnull().sum().max()

pas de valeurs nulles

In [None]:
print(df_train['revenue'].describe())
sns.distplot(a=df_train['revenue'], kde=True).set(xlabel='revenue', ylabel='P(revenue)')

In [None]:
df_train[df_train['revenue'] > 10000000 ]

In [None]:
# Elimination des valeurs extrêmes
df_train = df_train[df_train['revenue'] < 10000000 ]
df_train.reset_index(drop=True).head()

1-b la transformation des données (données discrétes,... )

In [None]:
# Catégorisation des caractéristique entre valeurs numériques et valeurs catégoriques
numerical_features = df_train.select_dtypes([np.number]).columns.tolist()
categorical_features = df_train.select_dtypes(exclude = [np.number,np.datetime64]).columns.tolist()
print(categorical_features)
print(numerical_features)

Transformation de la valeur Open Date à Age une valeur entiére

4-a Est-ce que la date d'ouverture d'un restaurant affecte t-elle la prédictioin final d'aprés la visualisation

In [None]:
from datetime import date, datetime

def calculate_age(born):
        born = datetime.strptime(born, "%m/%d/%Y").date()
        today = date.today()
        return today.year - born.year - ((today.month, today.day) < (born.month, born.day))

df_train['Age'] = df_train['Open Date'].apply(calculate_age)
df_test['Age'] = df_test['Open Date'].apply(calculate_age)

# Drop 'Open Date' column from Dataframes
df_train = df_train.drop('Open Date', axis=1)
df_test = df_test.drop('Open Date', axis=1)

# Drop 'Id' column from Dataframes


df_train.head()

On remarque d'aprés les résultat de la visualisation que  l'age du restaurant le revenue sont corrélée tant que l'age est inférieur à 11 ans, mais pour les ages plus de 11 ans il ne sont plus corrélés

In [None]:
result = df_train.groupby(["Age"])["revenue"].aggregate(np.median).reset_index()

norm = plt.Normalize(df_train["revenue"].values.min(), df_train["revenue"].values.max())
colors = plt.cm.Reds(norm(df_train["revenue"])) 

plt.figure(figsize=(12,8))
sns.barplot(x="Age", y="revenue", data=result, palette=colors)
plt.ylabel('revenue', fontsize=12)
plt.xlabel('Age', fontsize=12)
plt.xticks(rotation='vertical')
plt.show()

2-a la ville comportant le plus grand nombre de restaurants (ISTANBUL)

2-c le type des restaurant le plus présent dans le dataset 

le type de villes des restaurant ayant plus de revenue (Big Cities)

5- le type de restaurant générant plus de revenus ave justification (FC : Food Court)

In [None]:
categorical_features = df_train.select_dtypes(exclude = [np.number,np.datetime64]).columns.tolist()
fig, ax = plt.subplots(3, 1, figsize=(40, 30))
for variable, subplot in zip(categorical_features, ax.flatten()):
    df_2 = df_train[[variable,'revenue']].groupby(variable).revenue.sum().reset_index()
    df_2.columns = [variable,'total_revenue']
    sns.barplot(x=variable, y='total_revenue', data=df_2 , ax=subplot)
    subplot.set_xlabel(variable,fontsize=20)
    subplot.set_ylabel('Total Revenue',fontsize=20)
    for label in subplot.get_xticklabels():
        label.set_rotation(45)
        label.set_size(20)
    for label in subplot.get_yticklabels():
        label.set_size(20)
fig.tight_layout()

2-b les caractéristiques les plus corrélées avec la cible

In [None]:
n = len(df_train[numerical_features].columns)
w = 3
h = (n - 1) // w + 1
fig, axes = plt.subplots(h, w, figsize=(w * 6, h * 3))
for i, (name, col) in enumerate(df_train[numerical_features].items()):
    r, c = i // w, i % w
    ax = axes[r, c]
    col.hist(ax=ax)
    ax2 = col.plot.kde(ax=ax, secondary_y=True, title=name)
    ax2.set_ylim(0)

fig.tight_layout()

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df_train['Age'])
df_train['Age']=le.fit_transform(df_train['Age'])
le.fit(df_train['City'])
df_train['City']=le.fit_transform(df_train['City'])
le.fit(df_train['City Group'])
df_train['City Group']=le.fit_transform(df_train['City Group'])
le.fit(df_train['Type'])
df_train['Type']=le.fit_transform(df_train['Type'])

df_train.dtypes==object

In [None]:
df_train.info()

In [None]:
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df_test['Age'])
df_test['Age']=le.fit_transform(df_test['Age'])
le.fit(df_test['City'])
df_test['City']=le.fit_transform(df_test['City'])
le.fit(df_train['City Group'])
df_test['City Group']=le.fit_transform(df_test['City Group'])
le.fit(df_test['Type'])
df_test['Type']=le.fit_transform(df_test['Type']) 
df_test.info()

In [None]:
df_train.head()

In [None]:
df_test.head()

3- Visualisation des 5 Clusters en utilisant K-means

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

In [None]:
X = StandardScaler().fit_transform(df_train)
kmeans = KMeans(n_clusters=5)
model = kmeans.fit(X)
print("model\n", model)

In [None]:
centers = model.cluster_centers_
centers

4-b Visualtion de la démarche

In [None]:
from sklearn.cluster import DBSCAN
dbscan=DBSCAN(eps=3, min_samples=20)
dbscan.fit(X)

In [None]:
from sklearn.cluster import DBSCAN
dbscan=DBSCAN(eps=3, min_samples=20)
dbscan.fit(X)

In [None]:
labels=dbscan.labels_
#les valeurs extrêmes dont labels == -1
df_train[labels==-1]

In [None]:
df_train[labels==-1].shape

In [None]:
# La cible
y= df_train['revenue']

4-a Utilisation du KNN pour spécifier eps

In [None]:
from sklearn.neighbors import NearestNeighbors
from matplotlib import pyplot as plt

neighbors = NearestNeighbors(n_neighbors=5)
neighbors_fit = neighbors.fit(df_train)
distances, indices = neighbors_fit.kneighbors(df_train)
distances = np.sort(distances, axis=0)
distances = distances[:,1]
plt.plot(distances)

La valeur idéale de sera égale à la valeur de la distance au « creux du coude », ou au point de courbure maximale. Ce point représente le point d'optimisation où les rendements décroissants ne valent plus le coût supplémentaire. Ce concept de rendements décroissants s'applique ici car si l'augmentation du nombre de clusters améliore toujours l'ajustement du modèle, cela augmente également le risque de surajustement.
Dans ce cas la valeur d'EPSILON : Eps Est : 18

5- Visualisation en utilisant t-SNE (T-Distributed Stochastic Neighbouring Entities)

In [None]:
# Import TSNE
from sklearn.manifold import TSNE
model = TSNE(learning_rate=10)
tsne_features = model.fit_transform(X)
xs = tsne_features[:,0]
ys = tsne_features[:,1]
plt.scatter(xs,ys, c=y)
plt.show()
plt.clf()

# 1.2- Features Engineering

1- Visualisation de la corrélation des carctéristiques avec la cible

In [None]:
fig, ax = plt.subplots(13, 3, figsize=(30, 35))
for variable, subplot in zip(numerical_features, ax.flatten()):
    sns.regplot(x=df_train[variable], y=df_train['revenue'], ax=subplot)
    subplot.set_xlabel(variable,fontsize=20)
    subplot.set_ylabel('Revenue',fontsize=20)
fig.tight_layout()

In [None]:
fig = plt.figure(figsize=(20,16))
target_corr = df_train[df_train.columns[1:]].corr()['revenue']
order_corr = target_corr.sort_values()
y = pd.DataFrame(order_corr).index[:-1]
x = pd.DataFrame(order_corr).revenue[:-1]
sns.barplot(x, y, orient='h')
plt.show()

In [None]:
corr_positive= target_corr[target_corr>0]
corr_negative= target_corr[target_corr<0]
corr_positive


2- Visualisation de la matrice de corrélation des caractéristiques

In [None]:
plt.figure(figsize=(45,25))
mask = np.triu(np.ones_like(df_train.corr(), dtype=np.bool))
sns.heatmap(df_train.corr(),annot=True, mask=mask)
sns.set(font_scale=1.4)

Selon les graphiques de régression linéaire et la carte thermique des scores de corrélation, ces attributs numériques ont une relation linéaire très faible avec la variable cible « Revenus ». Le score de corrélation le plus élevé est « Age » qui est de 0,2, tandis que les autres ont un score de corrélation proche de 0. Cependant, il existe des groupes d'attributs qui ont une forte corrélation les uns avec les autres.

3-a **Affichage de la moyenne d'importance de chaque caractéristique en utilisant les facteurs obtenus par les methodes embarquées**


**Affichage de la moyenne d'importance de chaque caractéristique en utilisant les facteurs obtenus par les methodes d'emballage**

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
df_train['revenue'] = np.log1p(df_train['revenue'])
X, y = df_train.drop('revenue', axis=1), df_train['revenue']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=150)

params_ridge = {
    'alpha' : [.01, .1, .5, .7, .9, .95, .99, 1, 5, 10, 20],
    'fit_intercept' : [True, False],
    'normalize' : [True,False],
    'solver' : ['svd', 'cholesky', 'lsqr', 'sparse_cg', 'sag', 'saga']
}

ridge_model = Ridge()
ridge_regressor = GridSearchCV(ridge_model, params_ridge, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
ridge_regressor.fit(X_train, y_train)
print(f'Optimal alpha: {ridge_regressor.best_params_["alpha"]:.2f}')
print(f'Optimal fit_intercept: {ridge_regressor.best_params_["fit_intercept"]}')
print(f'Optimal normalize: {ridge_regressor.best_params_["normalize"]}')
print(f'Optimal solver: {ridge_regressor.best_params_["solver"]}')
print(f'Best score: {ridge_regressor.best_score_}')

In [None]:
ridge_model = Ridge(alpha=ridge_regressor.best_params_["alpha"], fit_intercept=ridge_regressor.best_params_["fit_intercept"], 
                    normalize=ridge_regressor.best_params_["normalize"], solver=ridge_regressor.best_params_["solver"])
ridge_model.fit(X_train, y_train)
y_train_pred = ridge_model.predict(X_train)
y_pred = ridge_model.predict(X_test)
print('Train r2 score: ', r2_score(y_train_pred, y_train))
print('Test r2 score: ', r2_score(y_test, y_pred))
train_rmse = np.sqrt(mean_squared_error(y_train_pred, y_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Train RMSE: {train_rmse:.4f}')
print(f'Test RMSE: {test_rmse:.4f}')

3-b Visualisation du résultat

In [None]:
# Ridge Model Feature Importance
ridge_feature_coef = pd.Series(index = X_train.columns, data = np.abs(ridge_model.coef_))
ridge_feature_coef.sort_values().plot(kind = 'bar', figsize = (13,5));

In [None]:
params_lasso = {
    'alpha' : [.01, .1, .5, .7, .9, .95, .99, 1, 5, 10, 20],
    'fit_intercept' : [True, False],
    'normalize' : [True,False],
}

lasso_model = Lasso()
lasso_regressor = GridSearchCV(lasso_model, params_lasso, scoring='neg_root_mean_squared_error', cv=5, n_jobs=-1)
lasso_regressor.fit(X_train, y_train)
print(f'Optimal alpha: {lasso_regressor.best_params_["alpha"]:.2f}')
print(f'Optimal fit_intercept: {lasso_regressor.best_params_["fit_intercept"]}')
print(f'Optimal normalize: {lasso_regressor.best_params_["normalize"]}')
print(f'Best score: {lasso_regressor.best_score_}')

In [None]:
lasso_model = Lasso(alpha=lasso_regressor.best_params_["alpha"], fit_intercept=lasso_regressor.best_params_["fit_intercept"], 
                    normalize=lasso_regressor.best_params_["normalize"])
lasso_model.fit(X_train, y_train)
y_train_pred = lasso_model.predict(X_train)
y_pred = lasso_model.predict(X_test)
print('Train r2 score: ', r2_score(y_train_pred, y_train))
print('Test r2 score: ', r2_score(y_test, y_pred))
train_rmse = np.sqrt(mean_squared_error(y_train_pred, y_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Train RMSE: {train_rmse:.4f}')
print(f'Test RMSE: {test_rmse:.4f}')

In [None]:
# Lasso Model Feature Importance
lasso_feature_coef = pd.Series(index = X_train.columns, data = np.abs(lasso_model.coef_))
lasso_feature_coef.sort_values().plot(kind = 'bar', figsize = (13,5));

In [None]:
from sklearn.neighbors import KNeighborsRegressor

params_knn = {
    'n_neighbors' : [3, 5, 7, 9, 11],
}

knn_model = KNeighborsRegressor()
knn_regressor = GridSearchCV(knn_model, params_knn, scoring='neg_root_mean_squared_error', cv=10, n_jobs=-1)
knn_regressor.fit(X_train, y_train)
print(f'Optimal neighbors: {knn_regressor.best_params_["n_neighbors"]}')
print(f'Best score: {knn_regressor.best_score_}')

In [None]:
knn_model = KNeighborsRegressor(n_neighbors=knn_regressor.best_params_["n_neighbors"])
knn_model.fit(X_train, y_train)
y_train_pred = knn_model.predict(X_train)
y_pred = knn_model.predict(X_test)
print('Train r2 score: ', r2_score(y_train_pred, y_train))
print('Test r2 score: ', r2_score(y_test, y_pred))
train_rmse = np.sqrt(mean_squared_error(y_train_pred, y_train))
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f'Train RMSE: {train_rmse:.4f}')
print(f'Test RMSE: {test_rmse:.4f}')

# 1.3 Apprentissage du modéle et réglage des hyper-paramétres

In [None]:
y = df_train['revenue']
df_train=df_train.drop('revenue', axis=1)
df_train

In [None]:
print("Shapes: Train set ", df_train.shape ,", Test ",df_test.shape)
df_full = pd.concat([df_train,df_test])
print("Full dataset shapes: ", df_full.shape)

In [None]:
print('There are {} cities which restaurant location have been collected.'.format(len(df_full['City'].unique())))

In [None]:
p_name = ['P'+str(i) for i in range(1,38)]
df_full

In [None]:
from sklearn.decomposition import PCA
pca = PCA().fit(df_full[p_name])
plt.figure(figsize=(7,5))
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of Components')
plt.ylabel('Explained variance')
plt.yticks(np.arange(0.1,1.1,0.05))
plt.xticks(np.arange(0,41,2))
plt.grid(True)

In [None]:
pca_list = ['pca'+str(i) for i in range(1,30,1)]
df_full[pca_list] = PCA(n_components=29).fit_transform(df_full[p_name])
df_full.drop(p_name,axis=1,inplace=True)


In [None]:
df=pd.get_dummies(df_full, dtype=float)

In [None]:
# Get number of train sets
numTrain=df_train.shape[0]

train = df[:numTrain]
test = df[numTrain:]

In [None]:
sns.distplot(a=y, kde=True).set(xlabel='revenue', ylabel='P(revenue)')


In [None]:
from sklearn.model_selection import train_test_split

# Split the data into train and test set
X_train, X_test, y_train, y_test =  train_test_split(df_train,y,test_size=0.33,random_state=50)
print("Shapes: ", X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import AdaBoostRegressor
from xgboost import XGBRegressor
best_estimators=[]
y

1- La regression logistique

2- Voting

In [None]:
from sklearn.ensemble import VotingRegressor
r1 = LinearRegression()
randomforest = RandomForestRegressor()

votingReg = VotingRegressor([('lr', r1), ('rf', randomforest)])
votingReg.fit(X_train, y_train)
y_pred = votingReg.predict(X_test)
RMSE_vo = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
best_estimators.append(["RMSE_voting",RMSE_vo])
RMSE_vo

3- Random forest

In [None]:
randomforest = RandomForestRegressor()
randomforest.fit(X_train, y_train)
y_pred = randomforest.predict(X_test)
RMSE_RF = math.sqrt(mean_squared_error(y_true = y_test, y_pred = y_pred))
print(RMSE_RF)
best_estimators.append(["RMSE_RandomForest",RMSE_RF])

4- AdaBoost

In [None]:
## parameters
params = {
    "n_estimators": [10, 30, 50, 100],
    "learning_rate": [.01, 0.1, 0.5, 0.9, 0.95, 1],
    "random_state" : [42]
}

## AdaBoost Regressor
AdaBoostR =   AdaBoostRegressor()
AdaBoostR_grid = GridSearchCV(AdaBoostR, params, scoring='r2', cv=5, n_jobs=-1)
AdaBoostR_grid.fit(X_train, y_train)

## Output
print("Best parameters:  {}:".format(AdaBoostR_grid.best_params_))
print("Best score: {}".format(AdaBoostR_grid.best_score_))

## Append to list
best_estimators.append(["AdaBoostR",AdaBoostR_grid.best_score_])

5- XGBoost

In [None]:
import xgboost

xgb_reg = xgboost.XGBRegressor()
xgb_reg.fit(X_train, y_train)
y_pred = xgb_reg.predict(X_test)
RMSE_XG=math.sqrt(mean_squared_error(y_test, y_pred))
best_estimators.append(["RMSE_XGBoost",RMSE_XG])
RMSE_XG

Comparaison de la performance des algorithmes

In [None]:
best_estimators

In [None]:
randomforest = RandomForestRegressor()
randomforest.fit(X_train, y_train)
y_pred = randomforest.predict(X_test)
# store the result
submission_df=pd.DataFrame(
{'Id':X_test.index,
'Prediction':y_pred}
)
submission_df
submission_df.to_csv('Submission.csv',index=False)
