# Table of Contents
___

-[Introduction](#Introduction)

-[Data Exploration](#Data-Exploration)

-[Understanding russian using googletrans](#Understanding-russian-using-googletrans)

-[Time series analysis](#Time-series-analysis)

-[Sliding Window Method](#Sliding-Window-Method)
   
-[Conclusions](#Conclusions)

## Introduction
___

In this kernel, is to make an EDA (exploration data analysis), focusing on how can we get information after using `googletrans` library to translate all the russian words in this dataset. The translation is done two times: one with a smaller dataframe and another with a larger dataframe (the only difference is the time of execution). For the larger, there are a few additions to try to optimize reduce the translation time. We'll also see a time series decomposition using the `statsmodel` library. At last, we will see how to reshape a time series forecasting problem into a supervised learning prediction one using a method called Sliding Window. In particular, we will make use of linear regression model.

First we import the libraries and get each dataset.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import scipy.stats as stats
import matplotlib.pyplot as plt
%matplotlib inline 
#shows plot from matplotlib and seaborn in the Jupyter notebook

In [None]:
df = pd.read_csv("../input/competitive-data-science-predict-future-sales/sales_train.csv")
items = pd.read_csv("../input/competitive-data-science-predict-future-sales/items.csv")
item_categories = pd.read_csv("../input/competitive-data-science-predict-future-sales/item_categories.csv")
shops = pd.read_csv("../input/competitive-data-science-predict-future-sales/shops.csv")
df.head()

The main dataset doesn't tell us much, but we can marge this dataset with the other two, so we get more information.

In [None]:
df = pd.merge(df, items, on="item_id")

df = pd.merge(df, item_categories, on="item_category_id")

df.head()

Well, now it's much better! Some information is written in russian, so we will have to deal with it later. For now, let's see how the sales, that we should get from item_price, change as the time passes.

## Data Exploration
___

There is a catch in this dataset. The days are written in the form day-month-year, rather than the usual month-day-year very common in english-written datasets. (Thanks @anatoly for calling my attention to that!) We now group by date and sum over the `item_price` and extract day, month and year from the `date` column.

In [None]:
revenueByDate = pd.DataFrame(df.groupby('date', as_index=False)['item_price'].sum())
revenueByDate["day"] = revenueByDate.date.str.extract("([0-9][0-9]).", expand = False)
revenueByDate["month"] = revenueByDate.date.str.extract(".([0-9][0-9]).", expand = False)
revenueByDate["year"] = revenueByDate.date.str.extract(".([0-9][0-9][0-9][0-9])", expand =False)
revenueByDate.head()

In [None]:
revenueByDate["date"] = pd.to_datetime(revenueByDate[["year", "month", "day"]])

plt.plot( "date", "item_price", data = revenueByDate.sort_values(by="date"))
plt.xticks(rotation=45)
plt.show()

Not as enlightening as I expected! The data is too noisy in this plot, as it accounts for the variation of each day's sales. It may be good if we make this plot monthly rather than daily, so we'll be able to see a less noisy graph and maybe get some intuition about our series.

But first, we will have to create a new date column so we will have only months and years.

In [None]:
revenueByDate["day"] = 1 #rewrite this column using ones so I can use the DatetimeIndex and aggregate sales by month
revenueByDate['monthlyDate'] = pd.to_datetime(revenueByDate[["year", "month", "day"]])

revenueByDate.head()

In [None]:
revenueByMonth = pd.DataFrame(revenueByDate.groupby("monthlyDate", as_index=False)["item_price"].sum())
revenueByMonth.head()

Great! Now we have the total sales for each month. What does this series looks like?

In [None]:
plt.plot("monthlyDate", "item_price", data = revenueByMonth.sort_values(by="monthlyDate"))
plt.xticks(rotation = 90);

Wow! Now it is much more clear how the sales behaved during all the period of records. We can see two peaks in sales, one in Dec 2013 and another in Dec 2014. Those peaks can be confirmed using `nlargest` method as indicated below.

In [None]:
revenueByMonth[revenueByMonth["item_price"] == revenueByMonth["item_price"].max()]

#gets the 2 highest item prices and their respective row index.
revenueByMonth["item_price"].nlargest(2)

revenueByMonth.iloc[[11,23],:]

We've just seen how sales behaved through the years, but how do monthly revenue looks like? 

In [None]:
monthlyRev = pd.DataFrame(revenueByDate.groupby(["month", "year"], as_index=False)["item_price"].sum())
monthlyRev.head()


g = sns.FacetGrid(data = monthlyRev.sort_values(by="month"), hue = "year", size = 5, legend_out=True)
g = g.map(plt.plot, "month", "item_price")
g.add_legend()
g;

2014's sales were higher than 2013's in all months, but as we've seen before, 2015 wasn't a good year for this store. Sales went lower than the 2 previous years just after January.

Before proceeding in our analysis, we should change the data type of some columns.

In [None]:
df["item_id"] = df["item_id"].astype("category")
df["item_name"] = df["item_id"].astype("category")
df["item_category_id"] = df["item_category_id"].astype("category")
df["item_category_name"] = df["item_category_name"].astype("str")
#apparently, if item_category_name is set as category, we cannot use googletrans in the next section

#a nice way to visualize the content of a dataset and also its shape. It is built to be similar to str function in R.
def rstr(df): 
    return df.shape, df.apply(lambda x: [x.unique()])

rstr(df)

We can create a new dataframe that has the total revenue over the years for each item category. I'll limit our dataframe to the 15 most rentable categories for analysis for simplicity.

In [None]:
sales_by_category = df.groupby("item_category_name", 
                             as_index = False)["item_price"].sum().sort_values(by = "item_price")

top_sales = sales_by_category.nlargest(n=15, columns=["item_price"])

top_sales.set_index(np.arange(0,len(top_sales),1))

Oh look! I recognize some words! I know what an Xbox, PS, DVD, PC, Blu-ray and CD are! But... that's all. How can we get information if we don't understand what is being written? Fear not, young grasshopper! I'll guide you with the help of Google (yeah.. always them).

## Understanding russian using googletrans
___

We finally arrived in this section! I cannot understand russian, although I wish I could, and I think many people here are in the same situation. In order to know what we are dealing with, we should translate those category names. Here I'll use googletrans library in order to provide a fair translation of the content. As we all know, google translations still have a way to go in some languages (translation in Portuguese, my mother language, are terrible sometimes), but hopefully the translation will be good enough to at least provide a basic understanding of the categories.

Here I'll loop through every row in top_sales dataframe and translate the categories while rewriting them. At the end, I'll use a barplot to visualize top_sales dataframe.

In [None]:
#The code below is used to translate each row of "item_category_name"
#However, this library needs to use the internet (to access Google)
#in order to make its translation. Kaggle doesn't let kernels to access
#the web, so in order to overcome this issue, I've uploaded top_sales
#dataframe after the translation, the same you should get after running this 
#piece of commented code in your local machine
'''
from googletrans import Translator

translator = Translator()

i = 0
for row in top_sales["item_category_name"]:
    english_word = translator.translate(row)
    top_sales.iloc[i,0] = english_word.text
    i+=1
'''

top_sales = pd.read_csv("../input/additional-data-for-competition/topsales.csv")

sns.barplot(y = "item_category_name", x = "item_price",
             data = top_sales)
plt.title("Sales for each one of the top 15 categories-products")
plt.xlabel("Sales")
plt.ylabel("Category-Product")

It is clear that most sales are related to eletronics, especially videogames and consoles. Perhaps the 2015 drop on the previous plot was because of a sharp cut in the purchases of game-related products. We can check it in a moment.

Instead of using the "category - product" format, I'll use only the category, so we can condense information a little bit.

In [None]:
top_sales["item_category"] = top_sales.item_category_name.str.extract('([A-Za-z\ ]+)', expand=False) 

top_sales.head()

It seems we got what we wanted. Now let's make the same plot as before:

In [None]:
sns.barplot(y = "item_category", x = "item_price",
             data = top_sales)
plt.title("Sales for each one of the top 8 categories")
plt.xlabel("Sales")
plt.ylabel("Category-Product")

This plot says what we already figured out from the previous barplot. Most sales are game-related. 

Now we could make a time series plot over each category and see the increase or decrease in sales over the years. First, though, we have to make a similar translation for all the dataset.

This time, I'm using the counts of items that were sold, rather than the total price, so I won't consider possible changes in prices through, but only the amount of itens sold.

In [None]:
dailyItensByCat = pd.DataFrame(df.groupby(["item_category_name", "date"], as_index=False)["item_cnt_day"].sum())

dailyItensByCat["month"] = dailyItensByCat.date.str.extract(".([0-9][0-9]).", expand = False)
dailyItensByCat["year"] = dailyItensByCat.date.str.extract(".([0-9][0-9][0-9][0-9])", expand =False)
dailyItensByCat["day"] = 1 #create this column so I can use the DatetimeIndex
dailyItensByCat['monthlyDate'] = pd.to_datetime(dailyItensByCat[["year", "month", "day"]])

monthlyItensByCat = pd.DataFrame(dailyItensByCat.groupby(["monthlyDate", "item_category_name"],
                                                         as_index = False)["item_cnt_day"].sum())
monthlyItensByCat.head()

In [None]:
monthlyItensByCat = pd.DataFrame(dailyItensByCat.groupby(["monthlyDate", "item_category_name"],
                                                         as_index = False)["item_cnt_day"].sum())

monthlyItensByCat.head()

In [None]:
monthlyItensByCat["item_category"] = (
    monthlyItensByCat.item_category_name.str.extract(r'((?i)[А-Яa-я\ ]+)', expand=False))

monthlyItensByCat.head()

#As a record: I've spent hours looking for a way to extract 
#cyrillic alphabet, but there is no simple answer on the web.
#In the end, turned out the solution was simple indeed and achieved by 
#trial and error. I hope this will help someone when they need a way
#to extract cyrillic alphabet words so they won't spend such a long time
#looking for an answer :P 



Again, we keep grouping. Now, we group by the new general category `item_category` and month.

In [None]:
countByCatByMonth = monthlyItensByCat.groupby(["item_category", "monthlyDate"], as_index=False)["item_cnt_day"].sum()
countByCatByMonth['item_category_trans'] = None
countByCatByMonth.head()

Finally we can translate the new dataset. Here I've made a slight change in the code from before. Since googletrans uses the internet to identify the language and translate it (you can set the language in your code, by the way), checking everytime for the same word sounds very time consuming. So I've created a dictionary to store the words that were already translated and only get them rather them querying google translator again. 

In [None]:
#Here again, we need internet access to perform the translations
#and Kaggle doesn't let us. So I uploaded the exact dataframe that
#should be generated after running the code below

'''
from googletrans import Translator

cat_translator = Translator()

obs = 0
already_translated = {'word': [], 'translation':[]}
#creates a new column in countByCatByMonth
countByCatByMonth['item_category_trans'] = None

#for each row...
for cat_name in countByCatByMonth["item_category"]:
    
    #check if it has already been translated
    if cat_name in already_translated['word']:
        #if it is, get the index of the original word on the dictionary...
        word_index = already_translated["word"].index(cat_name)
        #and use it to get the translated word
        countByCatByMonth.iloc[obs,3] = already_translated["translation"][word_index]
        #if the word was not translated yet...
    else:
        try:
            #translate it
            english_word = cat_translator.translate(cat_name)
            #append in dictionary for later use
            already_translated['word'].append(cat_name)
            already_translated['translation'].append(english_word.text)
            #write the translated word into dataframe
            countByCatByMonth.iloc[obs,3] = english_word.text
    
        except:
            print ("Error in row "+ cat_name)
    obs+=1
'''
countByCatByMonth = pd.read_csv("../input/additional-data-for-competition/countbycatbymonth.csv")
countByCatByMonth.head()

In [None]:
g = sns.FacetGrid(data = countByCatByMonth.sort_values(by="monthlyDate"), hue = "item_category_trans", legend_out=True, size = 8)
g = (g.map(plt.plot, "monthlyDate", "item_cnt_day").set(xticks=[0, 5, 11, 17, 23, 29, 35],
                                                        xticklabels=['2013-01-01', '2013-06-01', '2013-12-01', 
                                                                     '2014-06-01','2014-12-01', '2015-06-01','2015-12-01']))
g.add_legend()
plt.xlabel("Monthly Date")
plt.ylabel("Number of itens sold");


What a pretty good looking plot! From it we can see that items from those most profitable categories had a cut in sells. People had stopped buying such products since the beginning of 2014, although some itens had a sharper decrease than others.

## Time series analysis
___

First I'll analyse the item_price series. To make our analysis, it is best to set the dates as index, so on the series decomposition they will show up in x-axis.

In [None]:
ts = revenueByMonth.set_index("monthlyDate")
ts.head()

First let's take a look at the decomposition of the item_price time series. We will split our time series plot in three components: trend, seasonality and residual error (the part of the time series that cannot be explained by the former two). After that, we plot each component and analyse what we got.

In [None]:
from statsmodels.tsa.seasonal import seasonal_decompose
decomposition = seasonal_decompose(ts)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

f, ((ax1, ax2), (ax3, ax4)) = plt.subplots(2, 2, sharex='col', sharey = 'row')
ax1.plot(ts)
ax1.set_title("original")
ax2.plot(trend)
ax2.set_title("trend")
ax3.plot(seasonal)
ax3.set_title("seasonal")

ax4.plot(residual)
ax4.set_title("residual")
f.set_figheight(7)
f.set_figwidth(10)

#let me rotate the labels of x-axis. 
#Otherwise they get mixed and very hard to read.
for ax in f.axes:
    plt.sca(ax)
    plt.xticks(rotation = 45)

In this decomposition, we can describe the behaviour of the most important features of our timeseries. We can clearly see that sales increase every december. When we decompose a time series, the algorithm will always try to find a seasonality. So in order to confirm if there is or not a real seasonality, we should see both residual and seasonal plots and compare the scales. Sometimes, the residuals magnitude is way above seasonal magnitude, which means that if there is a seasonal effect, it is negligible. Here, however, seasonality does indeed exists, as the order of magnitude is the same (1e7) and the max value of the seasonality effect in sales is about 4 times the highest residual. 

Trend goes up until July 2014 and then starts to drop and keep this way as by the end of the records. 

Residual plot show that there are two key points in time where neither trend nor seasonal effects can explain the sales - Dec 2013 and Dec 2014. In the former, sales were lower than expected by the trend at the time and the seasonal effect, while on the latter, sales were higher than expected.

Now that we've seen how the item_price time series behave, let's make some predictions. Here I'll use a relative unorthodox method. Rather than ARIMA or ETS, we'll be using the...

## Sliding Window Method

This method can be used to transform what would be a time series forecast into a supervised learning prediction. The key thing we want here is to use the  value of `item_price` on the previous date as the predictor variable to the current date. For simplicity, I'll call the previous `item_price` as `item_price_x`. 

In [None]:
revenueByMonth["item_price_x"] = revenueByMonth["item_price"].shift(1)

slideRevenueByMonth = revenueByMonth.drop(columns = "monthlyDate")

sns.lmplot(data = slideRevenueByMonth, x = "item_price_x", y = "item_price", ci=False);

slideRevenueByMonth.head()

Note that the first row of `item_price_x` is a NaN value, because there is no previous `item_price` for the first row.

I'll also take the log of both columns so the data will be less spread out.

In [None]:
slideRevenueByMonth["log_itemprice"] = np.log(slideRevenueByMonth["item_price"])

slideRevenueByMonth["log_itemprice_x"] = np.log(slideRevenueByMonth["item_price_x"])

In [None]:
sns.lmplot(data = slideRevenueByMonth, x = "log_itemprice_x", y = "log_itemprice", ci=False);

slideRevenueByMonth.head()

The data now seems better to work on.  I'll use the log values to make the predictions here.

The sliding window method is now done and we can proceed to the actual training. Let's create the training and test datasets first and then use Random Forest and Linear regressors.

### First trials

In [None]:
xtrain = slideRevenueByMonth.iloc[1:20,3].values.reshape(-1,1)
ytrain = slideRevenueByMonth.iloc[1:20,2]

xtest = slideRevenueByMonth.iloc[20:,3].values.reshape(-1,1)
ytest = slideRevenueByMonth.iloc[20:, 2]

In [None]:
#Random Forest Regressor
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(random_state = 42)

rf_reg.fit(xtrain, ytrain)

ypred_rf = rf_reg.predict(xtest)

#LinearRegressor

from sklearn.linear_model import LinearRegression

linear = LinearRegression()

linear.fit(xtrain,ytrain)

ypred_linear = linear.predict(xtest)

Now that the predictions were done, we will see how the models performed against the testing set. The ideal scatterplot here should be a 45 degree line, so the predictions match perfectly the actual values. 

In [None]:
#Just putting all predictions in one dataframe to get things more organized
testing = pd.concat([pd.DataFrame(ypred_linear),
                     pd.DataFrame(ypred_rf),pd.DataFrame(ytest).reset_index().drop(["index"], axis = 1)], axis = 1)

testing.columns = ["log_itemprice_pred_ln","log_itemprice_pred_rf", "log_itemprice"]

#making the plot and also drawing a 45 degree line so we can see the ideal situation
grid = sns.JointGrid(y = testing.log_itemprice_pred_rf, x = testing.log_itemprice, space=0, size=6, ratio=50)
grid.plot_joint(plt.scatter, color="g")
plt.plot([17.8, 19.0], [17.8, 19.0], linewidth=2);

Well... could be worse, right? Let's see how the linear model performed.

In [None]:
grid = sns.JointGrid(y = testing.log_itemprice_pred_ln, x = testing.log_itemprice, space=0, size=6, ratio=50)
grid.plot_joint(plt.scatter, color="g")
plt.plot([17.8, 19.0], [17.8, 19.0], linewidth=2);

Doesn't look good either... In both models, low item prices (< 18.1) are predicted to be higher than they should be and high item prices are predicted to be lower. I thought on this for a while and maybe we could improve those results by creating a new variable. And that lead us to...

### The UpOrDown variable

We know that if there is a rising short time trend, it is more likely that our next value will go up. Similarly, if we are looking at a descending trend, it is likely the next value will be lower. So I will take the same approach as before but this time, if the previous value was lower than the present one, we will create a column called "up", and a column called "down" otherwise and use them as predictors as well. Since in our predictions we won't know the current value to compute the "UpOrDown" variable, we will use as a predictor the PreviousUpOrDown. That is, was the sales value increasing or decreasing last time we saw them?

This could solve the issue that, for example, our model sees the value "18" and, based on its previous training, thinks the next value will be 19. However, during training the local trend was a rising one, and now is a descending one. 

In [None]:
slideRevenueByMonth["UpOrDown"] = np.sign(slideRevenueByMonth["item_price"] - slideRevenueByMonth["item_price_x"])
slideRevenueByMonth["PreviousUpOrDown"] = slideRevenueByMonth["UpOrDown"].shift(1)
#drop two rows that have null values in UpOrDown and PreviousUpOrDown
slideRevenueByMonth = slideRevenueByMonth.dropna()
slideRevenueByMonth.head()

In [None]:
xtrain = slideRevenueByMonth.iloc[2:22,[3,5]]
ytrain = slideRevenueByMonth.iloc[2:22,2]

xtest = slideRevenueByMonth.iloc[22:,[3,5]]
ytest = slideRevenueByMonth.iloc[22:, 2]

##########Random Forest###########
from sklearn.ensemble import RandomForestRegressor

rf_reg = RandomForestRegressor(random_state = 42)

rf_reg.fit(xtrain, ytrain)

ypred_rf = rf_reg.predict(xtest)

##########LinearRegressor###########

from sklearn.linear_model import LinearRegression

linear = LinearRegression()

linear.fit(xtrain,ytrain)

ypred_linear = linear.predict(xtest)

In [None]:
#Creating the predictions dataframe
testing = pd.concat([pd.DataFrame(ypred_linear),
                     pd.DataFrame(ypred_rf),pd.DataFrame(ytest).reset_index().drop(["index"], axis = 1)], axis = 1)

testing.columns = ["log_itemprice_pred_ln","log_itemprice_pred_rf", "log_itemprice"]

grid = sns.JointGrid(y = testing.log_itemprice_pred_rf,x = testing.log_itemprice, space=0, size=6, ratio=50)
grid.plot_joint(plt.scatter, color="g")
plt.plot([17.8, 19.0], [17.8, 19.0], linewidth=2)
plt.title("Random Forest x True", fontsize = 20);

In [None]:
grid = sns.JointGrid(y = testing.log_itemprice_pred_ln,x = testing.log_itemprice, space=0, size=6, ratio=50)
grid.plot_joint(plt.scatter, color="g")
plt.plot([17.8, 19.0], [17.8, 19.0], linewidth=2)
plt.title("Linear Regression x True", fontsize = 20);

Random Forest didn't improve much, but Linear Regressor did a pretty amazing job here. It is clear that the results are biased, but now they are linear. If we manage to identify that bias, our predictions could be more accurate and thus allowing us to use Linear Regression to make forecasts. I believe that if we had access to more training data, the model would be better, as we only trained the algorithm with 16 data points and tested on the other rest. The fact that this method forces us drop two data points (one due to UpOrDown and another due to PreviousUpOrDown) makes even more important to have a greater number of training points for this technique to work at its best.

## Conclusions
___

In this notebook, we've seen how to deal with data in another language using `googletrans` package. Some exploratory analysis was done and it was seen that some categories were sold more frequently than others and that by mid-2014 sales had started dropping.  We've made a basic decomposition of the `item_price` time series and confirmed that there is indeed a seasonality (sales in December are increased by a lot).  At the end, we saw how to use the sliding window method to make forecasts using supervised machine learning techniques, especially Linear Regression. Further studies should be done on how to improve this method to make the predictions and use them to predict the number of items sold for all items in the dataset.

I'm ending the notebook here but as soon as I learn more about time series, I'll come back and finish this notebook with a forecast. I'm planning to use an ARIMA model to make the it. If you liked, please upvote! Any suggestions to improve the notebook or doubts please tell me on the comments!