<center><img src="img/logo_hse_black.jpg"></center>

<h1><center>Data Analysis</center></h1>
<h2><center>Seminar Introduction to Natural Language Processing<sup><a href="#fn1" id="ref1">1</a></sup></center></h2>

### Sentiment Analysis in Russian (from https://www.kaggle.com/c/sentiment-analysis-in-russian/data)

#### Load data

In [1]:
from tqdm import tqdm_notebook

In [2]:
import json

with open('Data/train.json') as json_file:
    data = json.load(json_file)

#### Tokenize and clean data

In [3]:
import nltk
nltk.download('stopwords')
import string
word_tokenizer = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('russian')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/r.britkov/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
def process_data(data):
    texts = []
    targets = []

    for item in data:
        if item['sentiment'] == 'negative':
            targets.append(0)
        else:
            targets.append(1)
        
        tokens = word_tokenizer.tokenize(item['text'].lower())
        
        #delete punct and stop words
        tokens = [word for word in tokens if (word not in string.punctuation and word not in stop_words)]
        
        texts.append(tokens)
    
    return texts, targets

In [5]:
texts, y = process_data(data)

#### Normalize words. Stemming

#### 1)Stemming

In [6]:
from nltk.stem.snowball import SnowballStemmer 

stemmer = SnowballStemmer("russian")

In [7]:
for i in tqdm_notebook(range(len(texts))):
    texts[i] = ' '.join(list(map(stemmer.stem, texts[i])))

HBox(children=(IntProgress(value=0, max=8263), HTML(value='')))




Advantages: fast

Disadvantages: not very intellectual normalization

Alternative: lemmatization (but this method slomly than stemming)

In [8]:
#train test_split
from sklearn.model_selection import train_test_split
train_texts, test_texts, train_y, test_y = train_test_split(texts, y, test_size=0.33, random_state=42, stratify = y)

### Recap TF-IDF

TF-IDF: method to measure importance of word in text

$$
TF-IDF(t,d) = tf(t,d) \cdot idf(t, D)
$$

where $tf(t,d)$ (in simple case) raw count of a term t in a document d.
$$
idf(t, D) = \log \frac{N}{N(t)}
$$
where $N$ - total number of documents, $N(t)$ - total number of documents with term t

In [9]:
#calc tf-idf
from sklearn.feature_extraction.text import TfidfVectorizer, TfidfTransformer
vectorizer = TfidfVectorizer(max_features = 40000)

In [10]:
X = vectorizer.fit_transform(train_texts)

X is matrix with shape = (number of sentences $\times$ number of words). Each coordinate is tf-idf for corresponding words. If in coordinate will be count of word in sentence this is call "Bag-of-words"

In [11]:
X.shape, len(texts)

((5536, 40000), 8263)

In [12]:
import numpy as np

In [13]:
X = X.toarray()

Now let's fit log-reg model, but 40000 is too large for model. Use PCA to reduce dimension

In [14]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
from sklearn.decomposition import PCA
pca = PCA(n_components = 1000)

In [15]:
%%time
X_small = pca.fit_transform(scaler.fit_transform(X))

CPU times: user 2min 41s, sys: 24.4 s, total: 3min 6s
Wall time: 2min 20s


In [16]:
X_small.shape

(5536, 1000)

In [17]:
from sklearn.linear_model import LogisticRegression

In [18]:
model = LogisticRegression()
model.fit(X_small, train_y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [19]:
predict = model.predict(X_small)
proba = model.predict_proba(X_small)

In [20]:
from sklearn.metrics import accuracy_score, roc_auc_score
print("ACCURACY = {}".format(accuracy_score(train_y, predict)))
print("ROC-AUC = {}".format(roc_auc_score(train_y, proba[:, 1])))

ACCURACY = 0.948157514450867
ROC-AUC = 0.9824134695757494


### Evaluate on test data

In [21]:
test_X = vectorizer.transform(test_texts).toarray()
test_X = pca.transform(scaler.transform(test_X))

In [22]:
predict = model.predict(test_X)
proba = model.predict_proba(test_X)
print("ACCURACY = {}".format(accuracy_score(test_y, predict)))
print("ROC-AUC = {}".format(roc_auc_score(test_y, proba[:, 1])))

ACCURACY = 0.8503850385038504
ROC-AUC = 0.8112756086900245


## Kaggle

You can use any models and algorithms that were told at a lecture or a seminar.

#### Example

### Load data

In [23]:
import pandas as pd

In [24]:
data = pd.read_csv('./data/train.tsv', sep = '\t')

In [25]:
data.head()

Unnamed: 0.1,Unnamed: 0,category_id,city,date_created,delivery_available,desc_text,img_num,lat,long,name_text,owner_id,payment_available,price,product_id,product_type,properties,region,sold_mode,subcategory_id,sold_fast
0,1,4,Краснодар,2018-10-08,False,"Продаю стол раскладной, деревянный, советский ...",3,45.0686,38.9518,Стол,4ce583fe8231a0cc4a3c7d241c7d0289,True,500.0,8cb80c05c65c210275f5500779d6b593,1,"[{'slug_id': 'stoly_stulya_tip', 'slug_name': ...",Краснодарский край,1,410,1
1,2,4,Тюмень,2018-06-18,False,"Тарелки глубокие 6 шт. Блюдца, чашки по 6 шт. ...",2,57.184,65.5674,Посуда,e58be2c8f143c17246dc2243b5d3b98f,False,300.0,3b7a9f8b27a53b63525f95bc8070abb2,1,"[{'slug_id': 'dom_dacha_posuda_tip', 'slug_nam...",Тюменская область,1,405,0
2,4,9,Омск,2018-07-31,True,"Новый,с этикеткой. Размер L. Не подошёл по раз...",1,54.9889,73.4312,Костюм,51b408796027214232532b7e478e2159,True,1100.0,c97dd9c5a3e938c52cf5d7822bc0eb7b,1,[{'slug_id': 'zhenskaya_odezhda_pidzhaki_kosty...,Омская область,1,908,0
3,6,3,Санкт-Петербург,2018-04-17,False,"Складывается тростью, все колеса вниз. Сплошна...",4,59.959,30.4877,Коляска,6544b83acbbf04439a7ba983093cafb4,True,5000.0,3e5d0286b25fd7f62f88bc436a59ae4e,1,"[{'slug_id': 'waggon_type', 'slug_name': 'Тип'...",Ленинградская область,1,312,0
4,10,5,Москва,2018-02-09,False,"Неразлучники, птичкам по 1,5 года. Продаю с бо...",2,55.6473,37.4118,Волнистые попугаи,ea575e28daf1f47bfce63015cd3ce5cf,True,2000.0,57b4a8679d0d3eb1e31367b57221098f,1,[],Московская область,1,504,0


###  Let's use only lat, long, price and for predict target (your solution must be more complicated and use text information from dataset)

In [26]:
X = data[['lat', 'long', 'price']].values
y = data['sold_fast'].values

In [27]:
from sklearn.preprocessing import OneHotEncoder

Always shuffle your data and don't forget fix random_seed and random_state

In [28]:
from sklearn.utils import shuffle

In [29]:
X, y = shuffle(X, y, random_state = 42)

In [30]:
model = LogisticRegression()
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Evaluate on test_data and save file for submit

In [31]:
data = pd.read_csv('./data/test_nolabel.tsv', sep = '\t')

In [32]:
product_id = data['product_id'].values
X = data[['lat', 'long', 'price']].values
proba = model.predict_proba(X)

In [33]:
data = pd.DataFrame.from_dict({'product_id' : product_id, 'score' : proba[:, 1]})
data.to_csv('./to_submit', sep = ',', index = False)