## SameSystem Data Science Interview Task

Your task is to use the data in train.csv file to build a model (or several alternative models) to **predict shop visitor flows**. 

And then build predictions and test predictions for the month of 2021-06 using test.csv file. You can use only data in train.csv for training you model. No external data can be used.

The data set columns are the following:

- **sales** - total gross sales minus discounts per hour (sales at 10th hour means sales from 10am to 10:59am)
- **customers** - number of transactions per hour
- **visitors** - number of visitors per hour
- **opening_hour** - shop official opening hour that day
- **closing_hour** - shop official closing hour that day
During the interview we will discuss the following questions:

Do you have any noteworthy comments about the visitor data you are predicting?
- How good is your model?
- What is the prediction accuracy in test period?
- What are the shortcommings of your current model?
- How could you improve the model given more time?

Some initial code is provided below to help you get started. 

### Requirements

- Do an exploratory data analysis of the provided dataset.
- Build a data preprocessing pipeline, which can be used for your machine learning models.
- Build a machine learning model that predicts shop visitor flows `visitors` column
- Use the above models to predict shop visitor flows in the test.csv file. Export these predictions to predictions.csv file.
- You should have a single Jupyter notebook, which shows all of your work. You should put helper functions, classes, etc. in separate Python modules and import them in this notebook. Submit this notebook with the output cells present.
- You should have Pickle (or similar) files for each of your trained models. There should be a clear and easy way to load these files to your Jupyter notebook and use them for predictions.
- Use Python 3 in this project. This requirement doesn't apply to libraries written in a different language.
- Don't use your name, Github alias or other information, which may identify you anywhere in the homework.
- Commit your solution to a private Github repository (they are free) and invite `linas-samesystem` Github user as a collaborator.

### Evaluation Criteria

The solution is evaluated based on these criteria (in order):
- The demonstration of your data analysis skills.
- The demonstration of your machine learning skills.
- Your knowledge of software engineering best practices.
- The RMSE of the visitor model measured on test.csv.
- The inference speed of your models.

In [1]:
# Import relevant Data Science packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from matplotlib import pyplot as plt     # Plotting library
import seaborn as sns 

In [2]:
# Use these options for nicer display of results
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.float_format', lambda x: '%.0f' % x)
pd.set_option('display.max_columns', 25)
pd.set_option('display.max_rows', 50)

In [3]:
# Let's read in the data
train = pd.read_csv('input/train.csv', index_col = 0)
test = pd.read_csv('input/test.csv', index_col = 0)

In [6]:
train.head()

Unnamed: 0,sales,number_of_customers,visitors,opening_hour,closing_hour
2017-01-02 11:00:00,1075,1,0,10,18
2017-01-02 12:00:00,1316,2,0,10,18
2017-01-02 13:00:00,2602,3,0,10,18
2017-01-02 14:00:00,518,2,0,10,18
2017-01-02 16:00:00,2395,1,0,10,18


In [7]:
test.head()

Unnamed: 0,visitors,opening_hour,closing_hour
2021-06-01 09:00:00,2,10,18
2021-06-01 10:00:00,15,10,18
2021-06-01 11:00:00,16,10,18
2021-06-01 12:00:00,21,10,18
2021-06-01 13:00:00,18,10,18
