# Работа с текстовыми фичами

Можно попробовать использовать использовать CountVectorizer с разными моделями, чтобы предсказать по description кассовые сборы фильма, и использовать это предсказание как ещё одну фичу для основной модели. В итоге получим простенький ансамбль, который может показать результаты получше.

In [1]:
import pandas as pd
import re
import os
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import *
import bisect
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.feature_extraction.text import CountVectorizer
import pickle

In [2]:
df = pd.read_csv(os.path.join('data', 'preprocessed_train.csv'))

Сделаем препроцессинг описаний фильмов стеммером

In [3]:
nltk.download('stopwords')

def preprocess_text(text: str) -> str:
    stop_words = set(stopwords.words('english'))
    stemmer = PorterStemmer()
    remove_extra_symb: list[str] = re.sub(r'[^\w^\s]+', '', str(text)).lower().split()
    return ' '.join([stemmer.stem(w) for w in remove_extra_symb if w not in stop_words])

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [4]:
df['description'] = df['description'].apply(preprocess_text)
df = df.drop(columns=['Unnamed: 0'])

Разделим датасет на train и test и составим входы для моделей

In [5]:
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

In [6]:
bow = CountVectorizer()
x_train = bow.fit_transform(train_df['description'])
y_train = train_df['revenue']
x_test = bow.transform(test_df['description'])
y_test = test_df['revenue']

Попробуем всевозможные модели

In [7]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [8]:
def get_best_model(model_list: list):
    best_model = None
    best_score = 0
    for model in model_list:
        model.fit(x_train, y_train)
        y_pred = model.predict(x_test)
        score = f1_score(y_true=y_test, y_pred=y_pred, average='weighted')
        print(f"{type(model).__name__} F1 Score is: {score:.4f}")
        if score > best_score:
            best_score = score
            best_model = model
    return best_model

In [9]:
model_list = [LogisticRegression(max_iter=2000), SVC(kernel='linear'), SVC(kernel='poly'), SVC(kernel='rbf'), MultinomialNB(), DecisionTreeClassifier(), RandomForestClassifier()]
model_list += [KNeighborsClassifier(n_neighbors=x) for x in range(1, 16, 2)]
model = get_best_model(model_list)

LogisticRegression F1 Score is: 0.2371
SVC F1 Score is: 0.2432
SVC F1 Score is: 0.1073
SVC F1 Score is: 0.1502
MultinomialNB F1 Score is: 0.2389
DecisionTreeClassifier F1 Score is: 0.1904
RandomForestClassifier F1 Score is: 0.1688
KNeighborsClassifier F1 Score is: 0.0560
KNeighborsClassifier F1 Score is: 0.0260
KNeighborsClassifier F1 Score is: 0.0959
KNeighborsClassifier F1 Score is: 0.1060
KNeighborsClassifier F1 Score is: 0.0807
KNeighborsClassifier F1 Score is: 0.0818
KNeighborsClassifier F1 Score is: 0.0601
KNeighborsClassifier F1 Score is: 0.0644


Наилучший результат показала SVC с линейным ядром

In [10]:
with open(os.path.join('data', 'text_model.pkl'), 'wb') as f:
    pickle.dump((model, bow), f)