### EDA
We are given a huge amount of data for customers' orders, locations of a restaurants. Let us see what insights we might from it

In [None]:
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
import seaborn as sns
import geopandas as gpd
import folium
from folium.plugins import HeatMap
import re

%matplotlib inline
plt.rcParams['figure.figsize'] = (12, 5);
sns.set_style('whitegrid')

First, we are given some varibale explanations.

In [None]:
with open('../input/restaurant-recommendation-challenge/VariableDefinitions.txt') as f:
    print(f.read())

In [None]:
chunk_size=10000
root_dir = '../input/restaurant-recommendation-challenge'
train_full = pd.read_csv(os.path.join(root_dir, 'train_full.csv'))
orders = pd.read_csv(os.path.join(root_dir, 'orders.csv'))
vendors = pd.read_csv(os.path.join(root_dir, 'vendors.csv'))

Files attached are realy vast, let's pick just a sample of those for performance and basic analysis.

In [None]:
train_full = train_full.sample(chunk_size)
orders = orders.sample(chunk_size)

In [None]:
train_full.head()

In [None]:
orders.head()

In [None]:
train_full.isna().sum().sum()

In [None]:
fig, ax = plt.subplots(1, 2)
train_full['gender'].hist(ax=ax[0], color='yellow')
train_full['location_type'].hist(ax=ax[1])

In [None]:
train_full['country_id'].value_counts()

Hm, let us see where are our locations anyway.

In [None]:
locs = gpd.read_file(os.path.join(root_dir, 'train_locations.csv'))
locs.dropna(subset=['latitude'], inplace=True)
locs.head()

In [None]:
def check_num(string):
    regex = r'-?[0-9]*.[0-9]*'
    m = re.match(regex, string)
    if m is None:
        return float(0)
    return float(string[:6])

locs['latitude'] = locs['latitude'].apply(check_num)
locs['longitude'] = locs['longitude'].apply(check_num)
locs['geometry'] = gpd.points_from_xy(locs['longitude'], locs['latitude'])

In [None]:
m = folium.Map(location=[50,-85], zoom_start=2)
for i in list(locs.index)[:50]:
    folium.Marker([locs.loc[i, 'latitude'], locs.loc[i, 'longitude']]).add_to(m)
m

Needless to say there is something wrong with our location data. At least we confirmed statment from variale defenitions.
> 'Latitude' and 'longitude': Not true latitude and longitude - locations have been masked, but nearby locations remain nearby in the new reference frame and can thus be used for clustering. However, not all locations are useful due to GPS errors and missing data - you may want to treat outliers separately.

In [None]:
sorted(train_full['location_number'].unique())

In [None]:
train_full[['status_x', 'status_y']].hist(color='magenta')

In [None]:
train_full['discount_percentage'].value_counts()

In [None]:
train_full['commission'].unique()

In [None]:
train_full['display_orders'].value_counts()

In [None]:
train_full['target'].sum()

In [None]:
train_full['rank'].hist()

In [None]:
train_full['prepration_time'].hist(color='gold')

Well... we have many not really informative features here. Probably we can get better insights from orders and customers individually?

In [None]:
orders.head()

In [None]:
orders.describe()

In [None]:
plt.hist(orders['payment_mode']);

In [None]:
sns.heatmap(orders.corr(), cmap="YlGnBu")

In [None]:
fig, ax = plt.subplots(1, 2)
sns.distplot(orders['grand_total'], ax=ax[0], color='purple')
sns.distplot(orders['item_count'], ax=ax[1])

In [None]:
fig, ax = plt.subplots(1, 2)
orders.loc[:, 'delivery_date'] = pd.to_datetime(orders['delivery_date'])
ax[0].scatter(orders.set_index('delivery_date').index, orders['item_count'], 
              label='items', alpha=0.6, color='red')
ax[0].legend();
ax[1].scatter(orders.set_index('delivery_date').index, orders['grand_total'], 
              label='total pay', alpha=0.6, color='green')
ax[1].legend();

In [None]:
orders.loc[:, 'delivery_time'] = pd.to_datetime(orders['delivery_time'], errors='coerce')
for i in range(0, 24):
    df = orders[orders['delivery_time'].dt.hour==i]
    orders.loc[df.index, 'delivery_hour'] = i
orders['delivery_hour'].hist(bins=24, label='orders by hour of day')
plt.legend();

In [None]:
orders.groupby('customer_id').mean()['grand_total'].plot(marker='.', linestyle='none', color='orange')
plt.title('total cost');

Huh, we even have people how appear to pay nothing... at least according to the given data. Let's dig some more.

In [None]:
orders[orders['grand_total']==0.0]

In [None]:
orders[orders['grand_total']==0.0]['promo_code'].isna().sum(), orders[orders['grand_total']==0.0].shape

Okay, so most of these people used promo code. A bit of insight we've got.

In [None]:
customers = pd.read_csv('../input/restaurant-recommendation-challenge/train_customers.csv')
customers.head()

In [None]:
customers['akeed_customer_id'].nunique(), customers.shape[0]

In [None]:
dists = ['gender', 'language', 'status', 'verified']
d=0
fig, ax = plt.subplots(2, 2)
for i in range(2):
    for j in range(2):
        customers[dists[d]].dropna().hist(ax=ax[i][j], label=dists[d], color='aqua')
        if dists[d] == 'gender':
            ax[i][j].tick_params(rotation=45)
        ax[i][j].legend();
        plt.tight_layout();
        d+=1

Ouch, there is some mess going on in the gender column.

In [None]:
def clean_string(string):
    string = str(string)
    if '?' in string or string=='nan' or string.strip(' ')=='':
        return np.nan
    string = string.strip(' ').lower()
    return string

customers.loc[:, 'gender'] = customers['gender'].apply(clean_string)
customers['gender'].hist(color='chocolate')

In [None]:
def calc_age(year):
    if len(str(year))==2:
        if str(year).startswith('0'):
            year = '20'+str(year)
        else:
            year = '19'+str(year)
        year = int(year)
    if year is None:
        return np.nan
    return 2020-year

customers.loc[:, 'age'] = customers['dob'].apply(calc_age)
customers[customers['age']<16]

Well... customers as young as 1 y.o. looks truely suspicious, not too mention the accounts of such customers seems to be created the same year or even earlier than they were born... That's funny but with high confidence we can tell these are mistaken records.

In [None]:
ages = customers[customers['age']>16]
ages = ages[ages['age']<110]
ages['age'].dropna().hist(bins=20, label='customers by age', color='brown')

That will conclude our quick look at data given. Clearly we need somewhat more effective tools to deal with high volume of data. That is what we will be doing in the <a href='https://www.kaggle.com/erelin6613/pyspark-alternating-least-squares-in-action?scriptVersionId=39511328'>next notebook</a>.