# Logsztikus regresszió használata
Első lépésként olvassuk be a létrehozott 3 .csv állományt

In [1]:
import pandas as pd

df_train = pd.read_csv("../data/emotions_train.csv", sep=",")
df_train.info()
df_train.head()

df_val = pd.read_csv("../data/emotions_val.csv", sep=",")
df_val.info()
df_val.head()

df_test = pd.read_csv("../data/emotions_test.csv", sep=",")
df_test.info()
df_test.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12875 entries, 0 to 12874
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Text     12875 non-null  object
 1   Emotion  12875 non-null  object
 2   Label    12875 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 301.9+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4292 entries, 0 to 4291
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Text     4292 non-null   object
 1   Emotion  4292 non-null   object
 2   Label    4292 non-null   int64 
dtypes: int64(1), object(2)
memory usage: 100.7+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4292 entries, 0 to 4291
Data columns (total 3 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   Text     4292 non-null   object
 1   Emotion  4292 non-null   object
 2   Label    4292 non-null   int64 
dtypes: int64(1), obj

Unnamed: 0,Text,Emotion,Label
0,i was feeling rather cranky cos i was thinking...,anger,1
1,i came out of the airport that makes me feel i...,anger,1
2,i feel dirty watching this series and you can ...,sadness,0
3,He was so miserabl,sadness,0
4,i feel like a moronic bastard,sadness,0


## Preprocess

Mivel a szövegek alapvetőleg elég jól elő voltak készítve (lowercased) így én egy nagyon leegyszerűsített előkészítést végeztem el rajta. Ez az előlészítés a tokenizálásból illetve e a stopword eltávolításból áll.

In [2]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string

STOPWORDS = stopwords.words("english")

def tokenize_sentence(sentence):
    return word_tokenize(sentence.lower())

def remove_stopwords(sentence):
    return " ".join([word for word in sentence.split(" ") if not word in STOPWORDS])

def preprocess(sentence):
    return tokenize_sentence(remove_stopwords(sentence))

print(preprocess(df_train["Text"].values[1]))

['feel', 'curious', 'impatient', 'eager', 'confused']


Láthatjuk, hogy néz ki az előkészített szöveg.

In [3]:
df_train["Text"] = df_train["Text"].apply(preprocess)
df_train.head()

Unnamed: 0,Text,Emotion,Label
0,"[feel, uptight, day, complete, hes, around, fe...",fear,4
1,"[feel, curious, impatient, eager, confused]",surprise,3
2,"[transgender, brainwashing, attempt, making, f...",sadness,0
3,"[like, products, organic, feel, assured, added...",happy,5
4,"[feel, bothered, getting, credit, equals, gett...",anger,1


A tokenizált szövegből szótárt képzünk.

In [4]:
def build_vocab(tokenized_input, vocab_size):
    d = dict()

    for tokens in tokenized_input:
        for token in tokens:
            # double check
            if token not in STOPWORDS and token.isalpha():
                d[token] = d.get(token, 0) + 1
    #del d["br"]
    return {k for k, v in sorted(d.items(), key=lambda item: item[1], reverse=True)[:vocab_size]}

In [5]:
VOCAB_SIZE = 10000
VOCAB = build_vocab(df_train["Text"].values, VOCAB_SIZE)
#print(VOCAB)

In [6]:
LABELS = df_train["Emotion"].unique()
LABELS
print(LABELS)

['fear' 'surprise' 'sadness' 'happy' 'anger' 'love']


Előállítjuk a "Freqency table"-t.

In [7]:
def get_frequencies_for_labels(df):
    dict_freqs = {label: {} for label in LABELS}

    for idx in range(df.shape[0]):
        tokens = df.iloc[idx, 0]
        label = df.iloc[idx, 1]
        
        for token in set(tokens):
            if token in VOCAB:
                dict_freqs[label][token] = dict_freqs[label].get(token, 0) + 1
            

    return dict_freqs

In [8]:
frequency_table = get_frequencies_for_labels(df_train)

In [9]:
#print(frequency_table)
frequency_table["anger"]["friend"]

27

In [10]:
import numpy as np

def extract_features(frequency_table, tweet_tokens):
    label_frequencies = {label: 0 for label in LABELS}

    for t in tweet_tokens:
        for label in LABELS:
            label_frequencies[label] += frequency_table[label].get(t, 0)
    
    return pd.Series(label_frequencies)

Ezt követően előkészítjük a logisztikus regressziós modellhez az X és Y értékeket. Az X igazából a tokenizált szövegekből előállított gyakorási táblázatból áll elő míg az Y a címkékből ami az egyes osztályokat jelöli.

In [11]:
X_train_logistic = df_train["Text"].apply(lambda tokens: extract_features(frequency_table, tokens))

Unnamed: 0,fear,surprise,sadness,happy,anger,love
0,2288,766,5762,6719,2654,1511
1,981,381,2523,2989,1199,672
2,1241,417,3220,3684,1475,840
3,1182,408,3261,3873,1533,927
4,983,329,2604,3053,1277,704
...,...,...,...,...,...,...
12870,1039,351,2700,3254,1246,724
12871,952,322,2519,3004,1159,673
12872,740,255,1604,1784,760,397
12873,973,301,1953,1911,977,450


In [12]:
y_train_logistic = df_train["Emotion"].values
y_train_logistic

array(['fear', 'surprise', 'sadness', ..., 'happy', 'anger', 'love'],
      dtype=object)

A logisztikus regressziós modell használatával a ```.fit()``` két paramétereként átadjuk a korábban előállított  ```X_train_logistic``` és ```y_train_logistic```-t és elkezdődhet a tanítattás.

In [20]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0, max_iter=12000).fit(X_train_logistic, y_train_logistic)

A továbbiakban pedig különböző értékek vizsgálatával ellenőrizhetjük a modell sikerességét.

In [21]:
from sklearn.metrics import accuracy_score

preds_train = clf.predict(X_train_logistic)

print("Train accuracy:", accuracy_score(y_train_logistic, preds_train))
print(preds_train)

Train accuracy: 0.718757281553398
['fear' 'surprise' 'sadness' ... 'happy' 'anger' 'love']


In [22]:
X_val = df_val["Text"].apply(preprocess)
X_val_logistic = df_val["Text"].apply(lambda tokens: extract_features(frequency_table, tokens))
y_val_logistic = df_val["Emotion"].values
X_val_logistic

Unnamed: 0,fear,surprise,sadness,happy,anger,love
0,65,40,127,188,51,47
1,29,17,71,103,24,23
2,92,67,319,379,144,77
3,36,24,78,106,34,19
4,17,14,51,61,24,12
...,...,...,...,...,...,...
4287,33,19,71,103,35,24
4288,25,19,54,72,15,18
4289,121,83,274,333,94,70
4290,39,26,87,125,18,29


In [23]:
preds_val = clf.predict(X_val_logistic)

print("Validation accuracy:", accuracy_score(y_val_logistic, preds_val))

Validation accuracy: 0.30242311276794037


In [24]:
X_test = df_test["Text"].apply(preprocess)
X_test_logistic = df_test["Text"].apply(lambda tokens: extract_features(frequency_table, tokens))
y_test_logistic = df_test["Emotion"].values
X_test_logistic

Unnamed: 0,fear,surprise,sadness,happy,anger,love
0,29,19,81,115,44,19
1,38,28,106,123,47,22
2,46,34,111,146,38,32
3,4,4,17,22,11,3
4,11,9,32,38,16,4
...,...,...,...,...,...,...
4287,23,11,43,44,12,12
4288,42,37,136,164,55,34
4289,85,48,188,246,73,55
4290,44,34,123,141,66,26


In [25]:
preds_test = clf.predict(X_test_logistic)

print("Test accuracy:", accuracy_score(y_test_logistic, preds_test))

Test accuracy: 0.2982292637465051
