<font size=6>Predicting Future Sales</font>  
<font size=5>Part 1 : Data Preparation - Translating from Russian</font>


---

**Environment check**

In [None]:
import sys
IN_COLAB = "google.colab" in sys.modules
# PATH_DRIVE : to change according to your Google Drive folders
PATH_DRIVE = "/content/drive/My Drive/MachineLearning/ML08"

In [None]:
if IN_COLAB:
    print("Le notebook est exécuté sur Google Colab")
else:
    print("Le notebook est exécuté en local")

Le notebook est exécuté sur Google Colab


In [None]:
if IN_COLAB:
    from google.colab import drive, files
    drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


---
## <font color=blue>Notebook set-up</font>

In [None]:
import pandas as pd
import numpy as np
import os
import re
import random as python_random

In [None]:
if IN_COLAB:
    sys.path.append(PATH_DRIVE)
    os.chdir(PATH_DRIVE)

In [None]:
RANDOM_SEED = 42
BATCH_SIZE = 32

In [None]:
def reset_random_seeds():
    np.random.seed(RANDOM_SEED)
    python_random.seed(RANDOM_SEED)

In [None]:
reset_random_seeds()

---
## <font color=blue> 1. Translation : from Russian to English

It's hard to explore a dataset without understanding its content.
In this competition, the name of products, categories and shops are in Russian. I translate all Russian names in English. Some hidden informations may appear after that stage.

### 1.1. Items

In [None]:
items = pd.read_csv("data/items.csv")
print("Number of rows : {:,.0f}".format(len(items)))
items.head()

Number of rows : 22,170


Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


I clean item_name with a regex :
- removal of ponctuations except () and [] : re.sub("[^\w\s\(\)\[\]]+", " ", ...
- removal of multiple spaces : re.sub("\s{2,}", " ", ...
- removal of leading and trailing spaces : .strip()  
  
This is helpful for the translation of item_names.

In [None]:
items["item_name"] = \
    [re.sub("\s{2,}", " ", re.sub("[^\w\s\(\)\[\]]+", " ", i)).strip()
     for i in items.item_name]
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ ) D,0,40
1,ABBYY FineReader 12 Professional Edition Full ...,1,76
2,В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,КОРОБКА (СТЕКЛО) D,4,40


I save item_names in a txt file to submit it to a translation tool like Google Translation or DeepL. I find DeepL better for this task, but you must suscribe to their professional version to handle so much data (one-month free trial). I split the data in two text files to keep the weight of each file under 1Mo. I have noticed that keeping the index in front of the item name can result in translation errors. So I only store the item name.


In [None]:
mid_items = int(len(items) / 2)
items_1 = items.iloc[:mid_items][["item_name"]].copy()
items_1.to_csv("data/items_name_1_noindex.txt", index=False, header=False)
items_2 = items.iloc[mid_items:][["item_name"]].copy()
items_2.to_csv("data/items_name_2_noindex.txt", index=False, header=False)

I load the translated files and I remplace russian names by english names.

In [None]:
english1 = pd.read_csv("data/items_name_1_noindex_translated.txt", header=None)
english2 = pd.read_csv("data/items_name_2_noindex_translated.txt", header=None)
item_name = pd.concat([english1, english2], ignore_index=True)
item_name.index = items.index
items["item_name"] = item_name[0].values
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,IN THE POWER OF OBSESSION (LAYER ) D,0,40
1,ABBYY FineReader 12 Professional Edition Full ...,1,76
2,IN THE GLORY (UNV) D,2,40
3,Blue Wave (Univ) D,3,40
4,BOX (GLASS) D,4,40


In [None]:
items.to_csv("data/items_english.csv", index=False)

### 1.2. Categories

In [None]:
categories = pd.read_csv("data/item_categories.csv")
print("Number of rows : {:,.0f}".format(len(categories)))
categories.head()

Number of rows : 84


Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


I remove unnecessary punctuations and spaces (as for items, but I keep hyphens.

In [None]:
categories["item_category_name"] = \
    [re.sub("\s{2,}", " ", re.sub("[^\w\s\(\)\[\]\-]+", " ", c)).strip()
     for c in categories.item_category_name]
categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [None]:
categories_english = categories[["item_category_name"]].copy()
categories_english.to_csv("data/categories_english.txt", header=False, index=False)

I use DeepL to translate and I change Russian names by English ones.

In [None]:
categories_name = pd.read_csv("data/categories_english_translated.txt", header=None)
categories_name.columns = ["item_category_name"]
categories_name.index = categories.index
categories["item_category_name"] = categories_name.item_category_name
categories.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Headphone headsets,0
1,Accessories - PS2,1
2,Accessories - PS3,2
3,Accessories - PS4,3
4,Accessories - PSP,4


In [None]:
categories.to_csv("data/item_categories_english.csv", index=False)

### 1.3. Shops

In [None]:
shops = pd.read_csv("data/shops.csv")
print("Number of rows : {:,.0f}".format(len(shops)))
shops.head()

Number of rows : 60


Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


In [None]:
shops["shop_name"] = \
    [re.sub("\s{2,}", " ", re.sub("[^\w\s\(\)\[\]\-]+", " ", s)).strip()
     for s in shops.shop_name]
shops.head()

Unnamed: 0,shop_name,shop_id
0,Якутск Орджоникидзе 56 фран,0
1,Якутск ТЦ Центральный фран,1
2,Адыгея ТЦ Мега,2
3,Балашиха ТРК Октябрь-Киномир,3
4,Волжский ТЦ Волга Молл,4


In [None]:
shops_english = shops[["shop_name"]].copy()
shops_english.to_csv("data/shops_english.txt", header=False, index=False)

I use DeepL to translate and I change Russian names by English ones.

In [None]:
shops_name = pd.read_csv("data/shops_english_translated.txt", header=None)
shops_name.columns = ["shop_name"]
shops_name.index = shops.index
shops["shop_name"] = shops_name.shop_name
shops.head()

Unnamed: 0,shop_name,shop_id
0,Yakutsk Ordzhonikidze 56 francs,0
1,Yakutsk shopping center Central franc,1
2,Adygeya shopping center Mega,2
3,Balashikha shopping mall Oktyabr-Kinomir,3
4,Volga shopping center Volga Mall,4


In [None]:
shops.to_csv("data/shops_english.csv", index=False)