> <h3 style="color: red;"><b>IMPORTANT</b></h3>
> 
> Я в третий раз буду говорить о том, что данный проект лучше всего запускать в специализированной среде, чтобы не устанавливать лишних зависимостей.

## 0. Оглавление
<details>
<summary>Развернуть оглавление</summary>

1. Описываем задачу
2. Импортируем зависимости
3. Подготовка необходимого инструментария
4. EDA Анализ
5. Кодируем и оцениваем
6. Обучаем 
7. Производим оценку модели
</details>

## 1. Описываем задачу

`Задача: классификация текста`

Требуется классифицировать твиты по чемпионату мира по футболу 2022 года. Всего есть 3 класса: positive, negative and neutral.

<details>
<summary>Данные</summary>
Date Created - дата
Number of Likes - кол-во лайков
Source of Tweet - источник твита
Tweet - сам твит
Sentiment - метка класса
</details>

**Что надо сделать:**

1. EDA, data processing ( графики, статистика, обработка данных, какие то выводы по вашим данным)
2. Classic ML approaches (Не менее 3-х)(Logistic regression, SVM etc. )
3. DL approaches (Не менее 2-х) (FC, RNN, CNN 1d, 2d, LSTM). Предобученные брать можно, трансформеры пока брать нельзя.

## 2. Импортируем зависимости


In [None]:
import re
import string
from nltk.stem import PorterStemmer
import nltk
import pandas as pd
import numpy as np
import xgboost as xgb
from gensim.models import Word2Vec, Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import f1_score, make_scorer
from sklearn.pipeline import make_pipeline
import warnings
warnings.filterwarnings('ignore')

In [None]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [None]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


True

In [None]:
stop_words = nltk.corpus.stopwords.words("english")

## 3. Подготовка необходимого инструментария


In [None]:
def lower(text):
    return text.lower()


def tokenization(text):
    text = re.split('\\W+', text)
    return text


def remove_punctuation(text):
    return [word for word in text if word not in string.punctuation]


def remove_stopwords(text):
    text = [word for word in text if word not in stop_words]
    return text


def stemming(text):
    ps = PorterStemmer()
    text = [ps.stem(word) for word in text]
    return text


def lemmatization(text):
    lem = nltk.stem.WordNetLemmatizer()
    text = [lem.lemmatize(word) for word in text]
    return text


def remove_numbers(text):
    text = [word for word in text if not word.isnumeric()]
    return text


def remove_short_words(text, min_len=3):
    text = [word for word in text if len(word) > min_len]
    return text


def remove_long_words(text, max_len=15):
    text = [word for word in text if len(word) < max_len]
    return text


def remove_empty(text):
    text = [word for word in text if word != '']
    return text


def remove_space(text):
    text = [word.strip() for word in text]
    return text

In [None]:
class Preprocess(BaseEstimator, TransformerMixin):
    def __init__(self, lower=True, tokenization=True, remove_punctuation=True, remove_stopwords=True, stemming=True, lemmatization=True, remove_numbers=True, remove_short_words=True, remove_long_words=True, remove_empty=True, remove_space=True, remove_specific_words=False):
        self.lower = lower
        self.tokenization = tokenization
        self.remove_punctuation = remove_punctuation
        self.remove_stopwords = remove_stopwords
        self.stemming = stemming
        self.lemmatization = lemmatization
        self.remove_numbers = remove_numbers
        self.remove_short_words = remove_short_words
        self.remove_long_words = remove_long_words
        self.remove_empty = remove_empty
        self.remove_space = remove_space
        self.remove_specific_words = remove_specific_words

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        if self.lower:
            X = X.apply(lambda x: lower(x))
        if self.tokenization:
            X = X.apply(lambda x: tokenization(x))
        if self.remove_punctuation:
            X = X.apply(lambda x: remove_punctuation(x))
        if self.remove_stopwords:
            X = X.apply(lambda x: remove_stopwords(x))
        if self.stemming:
            X = X.apply(lambda x: stemming(x))
        if self.lemmatization:
            X = X.apply(lambda x: lemmatization(x))
        if self.remove_numbers:
            X = X.apply(lambda x: remove_numbers(x))
        if self.remove_short_words:
            X = X.apply(lambda x: remove_short_words(x))
        if self.remove_long_words:
            X = X.apply(lambda x: remove_long_words(x))
        if self.remove_empty:
            X = X.apply(lambda x: remove_empty(x))
        if self.remove_space:
            X = X.apply(lambda x: remove_space(x))

        return X

In [None]:
pipe = Pipeline([
    ('preprocess', Preprocess())
])

`Через Processes будем конвеерно автоматизировать обработку текста.`

## 4. EDA анализ


In [None]:
df = pd.read_csv('train_data.csv', index_col=0)
df_submission = pd.read_csv('test_data.csv', index_col=0)

In [None]:
X = df['Tweet'].copy()

In [None]:
clean_text = pipe.fit_transform(X)

In [None]:
clean_text_str = clean_text.apply(lambda x: ' '.join(x))

In [None]:
df['text'] = clean_text_str

In [None]:
df['class_text'] = pd.Categorical(df['Sentiment'])
df['class'] = df['class_text'].cat.codes

In [None]:
df['class_text'].value_counts()

positive    7539
neutral     7372
negative    5089
Name: class_text, dtype: int64

In [None]:
models = {
    'logistic_regression': LogisticRegression(),
    'decision_tree': DecisionTreeClassifier(),
    'random_forest': RandomForestClassifier(),
    'xgboost': xgb.XGBClassifier()

}

In [None]:
X = df['text']
y = df['class']

## 5. Кодируем и оцениваем


`Для более комфортной работы превратим метки из текста в число, и через кросс-валидацию выберем лучшую модель, опираясь на полученные метрики.`


In [None]:
# Custom transformer for Word2Vec


class Word2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None
        self.size = 100

    def fit(self, X, y=None):
        sentences = [row.split() for row in X]
        self.model = Word2Vec(sentences, vector_size=self.size, window=5, min_count=1, workers=4)
        return self

    def transform(self, X):
        return np.array([
            np.mean([self.model.wv[word] for word in words if word in self.model.wv]
                    or [np.zeros(self.size)], axis=0)
            for words in [row.split() for row in X]
        ])

# Custom transformer for Doc2Vec


class Doc2VecVectorizer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.model = None
        self.size = 100

    def fit(self, X, y=None):
        tagged_data = [TaggedDocument(words=row.split(), tags=[str(i)]) for i, row in enumerate(X)]
        self.model = Doc2Vec(tagged_data, vector_size=self.size, window=2, min_count=1, workers=4, epochs=20)
        return self

    def transform(self, X):
        return np.array([self.model.infer_vector(row.split()) for row in X])

# BERT Embeddings - Using previously defined bert_encode function


tfidf_vectorizer = TfidfVectorizer()
count_vectorizer = CountVectorizer()
word2vec_vectorizer = Word2VecVectorizer()
doc2vec_vectorizer = Doc2VecVectorizer()

f1_micro_scorer = make_scorer(f1_score, average='micro')

vectorizers = {
    'tfidf': tfidf_vectorizer,
    'count': count_vectorizer
}

results = {}

for model_name, model in models.items():
    for vec_name, vectorizer in vectorizers.items():
        if vec_name != 'bert':
            pipeline = make_pipeline(vectorizer, model)
            scores = cross_val_score(pipeline, X, y, cv=5, scoring=f1_micro_scorer)
            print(f"Model: {model_name} | Vectorizer: {vec_name} | F1 Micro Cross-Validation Scores: {scores}")
            print(f"Average F1 Micro Score: {np.mean(scores)}")
            avg_score = np.mean(scores)

            results[(model_name, vec_name)] = avg_score

Model: logistic_regression | Vectorizer: tfidf | F1 Micro Cross-Validation Scores: [0.7035  0.6905  0.70525 0.68425 0.70525]
Average F1 Micro Score: 0.69775
Model: logistic_regression | Vectorizer: count | F1 Micro Cross-Validation Scores: [0.68925 0.68175 0.6975  0.6855  0.7    ]
Average F1 Micro Score: 0.6908000000000001


## 6. Обучаем


`Для прогона наших данных определим несколько моделек, которые будем обучать.`

In [None]:
best_model, best_vec = max(results, key=results.get)
best_score = results[(best_model, best_vec)]

print(f"Best Model: {best_model} | Vectorizer: {best_vec} | F1 Micro Score: {best_score}")

Best Model: logistic_regression | Vectorizer: tfidf | F1 Micro Score: 0.69775


In [None]:
df_submission = pd.read_csv('/content/test_data.csv', index_col=0)

`Для возможного дальнейшего анализа мы подготавливаем наши тестовые данные через лучшего конкурсанта (модель) и запихиваем все в файл.`


In [None]:
X_clean = pipe.transform(X)
X_clean_str = X_clean.apply(lambda x: ' '.join(x))

best_vectorizer = vectorizers[best_vec].fit(X_clean_str)
X_vectorized = best_vectorizer.transform(X_clean_str)

best_model_instance = models[best_model].fit(X_vectorized, y)

X_submission = df_submission['Tweet'].copy()
X_submission_clean = pipe.transform(X_submission)
X_submission_clean_str = X_submission_clean.apply(lambda x: ' '.join(x))

X_submission_vectorized = best_vectorizer.transform(X_submission_clean_str)

predictions = best_model_instance.predict(X_submission_vectorized)

label_mapping = {0: 'negative', 1: 'neutral', 2: 'positive'}
predicted_labels = [label_mapping[label] for label in predictions]

submission_df = pd.DataFrame({
    'ID': df_submission.index,
    'label': predicted_labels
})

submission_df.to_csv('/content/submission.csv', index=False)