### **DATA DESCRIPTION**

You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

## **File descriptions**

sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
sample_submission.csv - a sample submission file in the correct format.
items.csv - supplemental information about the items/products.
item_categories.csv  - supplemental information about the items categories.
shops.csv- supplemental information about the shops.

## **Data fields**

ID - an Id that represents a (Shop, Item) tuple within the test set
shop_id - unique identifier of a shop
item_id - unique identifier of a product
item_category_id - unique identifier of item category
item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
item_price - current price of an item
date - date in format dd/mm/yyyy
date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
item_name - name of item
shop_name - name of shop
item_category_name - name of item category
This dataset is permitted to be used for any purpose, including commercial use.

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

from scipy import stats

import warnings
warnings.filterwarnings("ignore")

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, Lasso, Ridge
print("Imported")

In [None]:
sns.set_theme()

In [None]:
df_item_categories  = pd.read_csv("../input/competitive-data-science-predict-future-sales/item_categories.csv")
df_items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
df_sales_train = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
df_sample_submission = pd.read_csv("../input/competitive-data-science-predict-future-sales/sample_submission.csv")
df_shops = pd.read_csv("../input/competitive-data-science-predict-future-sales/shops.csv")
df_test = pd.read_csv("../input/competitive-data-science-predict-future-sales/test.csv")

In [None]:
df_test.set_index("ID").head()

In [None]:
df_sample_submission.set_index("ID").head()

In [None]:
df_sales_train.head()

In [None]:
df_sales_train.shape

In [None]:
df_sales_train.isnull().sum()

In [None]:
#df_sales_train.duplicated().sum()
df_sales_train.drop_duplicates(inplace=True)
df_sales_train.info()

In [None]:
#df_sales_train["date"] = df_sales_train['date'].astype('datetime64[ns]')
#df_sales_train["day"] = df_sales_train["date"].dt.day
#df_sales_train["month"] = df_sales_train["date"].dt.month
#df_sales_train["year"] = df_sales_train["date"].dt.year

In [None]:
df_sales_train['date'] = pd.to_datetime(df_sales_train['date'], dayfirst=True)
df_sales_train['date'] = df_sales_train['date'].apply(lambda x: x.strftime('%Y-%m'))

In [None]:
df_sales_train.describe()

In [None]:
df_sales_train.corr()

In [None]:
list(df_sales_train.columns)

In [None]:
plt.figure(figsize=(25,7))
sns.lineplot(data=df_sales_train, x = "date", y = "item_cnt_day")
plt.title("ITEM_CNT_DAY w.r to DATE", fontsize = 20)
plt.ylabel("ITEM_CNT_DAY", fontsize = 15)
plt.xlabel("DATE", fontsize = 15)
plt.xticks(rotation=90)
plt.show()

In [None]:
plt.figure(figsize=(15,9))
sns.pairplot(data=df_sales_train)
plt.show()

In [None]:
len(df_sales_train["date"].unique())

In [None]:
#df_sales_train['item_cnt_day']=df_sales_train.groupby([['date','']]).transform(lambda x: +x)
df_sales_train.head()

In [None]:
data = df_sales_train[["date","shop_id","item_id","item_cnt_day","item_price"]]
data = data.groupby(['date','shop_id',"item_id"]).agg({'item_price':'sum', 'item_cnt_day':'sum'})
data.columns = ['item_price', 'item_cnt_month']
data = data.reset_index()
data.head()

In [None]:
slope, intercept, r_value, p_value, std_error = stats.linregress(data["shop_id"],data["item_cnt_month"])
print("slope     :{}".format(slope))
print("intercept :{}".format(intercept))
print("r_value   :{}".format(r_value))
print("p_value   :{}".format(p_value))
print("std_error :{}".format(std_error))
print("r_squared :{}".format(r_value**2))

In [None]:
slope, intercept, r_value, p_value, std_error = stats.linregress(data["item_id"],data["item_cnt_month"])
print("slope     :{}".format(slope))
print("intercept :{}".format(intercept))
print("r_value   :{}".format(r_value))
print("p_value   :{}".format(p_value))
print("std_error :{}".format(std_error))
print("r_squared :{}".format(r_value**2))

In [None]:
slope, intercept, r_value, p_value, std_error = stats.linregress(data["item_price"],data["item_cnt_month"])
print("slope     :{}".format(slope))
print("intercept :{}".format(intercept))
print("r_value   :{}".format(r_value))
print("p_value   :{}".format(p_value))
print("std_error :{}".format(std_error))
print("r_squared :{}".format(r_value**2))

In [None]:
X1 = data[[
       'item_price'
       ]]
Y1 = data['item_cnt_month']

plt.figure(figsize=(10,8))
regressor = LinearRegression()
regressor.fit(X1,Y1)
regline=regressor.coef_*X1+regressor.intercept_
plt.scatter(X1,Y1,COLOR='crimson')
plt.plot(X1,regline, color='red')
plt.grid(alpha=0.7)
plt.xlabel("ITEM_PRICE")
plt.ylabel("ITEM_CNT_MONTH")
plt.show()

In [None]:
X2 = data[[
       'item_id'
       ]]
Y2 = data['item_cnt_month']

plt.figure(figsize=(10,8))
regressor = LinearRegression()
regressor.fit(X2,Y2)
regline=regressor.coef_*X2+regressor.intercept_
plt.scatter(X2,Y2,COLOR='crimson')
plt.plot(X2,regline, color='red')
plt.grid(alpha=0.7)
plt.xlabel("ITEM_ID")
plt.ylabel("ITEM_CNT_MONTH")
plt.show()

In [None]:
X3 = data[[
       'shop_id'
       ]]
Y3 = data['item_cnt_month']

plt.figure(figsize=(10,8))
regressor = LinearRegression()
regressor.fit(X3,Y3)
regline=regressor.coef_*X3+regressor.intercept_
plt.scatter(X3,Y3,COLOR='crimson')
plt.plot(X3,regline, color='red')
plt.grid(alpha=0.7)
plt.xlabel("SHOP_ID")
plt.ylabel("ITEM_CNT_MONTH")
plt.show()

In [None]:
X4 = data[[
       'shop_id','item_id'
       ]]
Y4 = data['item_cnt_month']

regressor = LinearRegression()
regressor.fit(X4,Y4)

predict = regressor.predict(df_test[["shop_id","item_id"]])
print(predict)

In [None]:
data.describe()