## Эксперименты с моделями

### Импорт зависимостей

**Загрузка библиотек**

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

In [3]:
import numpy as np

In [4]:
from sklearn.naive_bayes import GaussianNB

**Зарузка данных**

In [5]:
df = pd.read_parquet('elon_musk_tweets_labeled.parquet')  # Формируется в ноутбуке preprocessing
X = df.text  # Еще нужно достать оттуда какое-то векторное представление
y = df.feeling_auto  # Куда делся df.lemmatized??

### Простые правила

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)
y_pred = np.random.randint(low=0, high=1, size=y_test.shape[0])
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 2952 points : 1833


словарный подход
- https://datascience.stackexchange.com/questions/120870/measuring-sentiment-using-a-dictionary-based-model
- https://arxiv.org/pdf/2311.06221
- https://stackoverflow.com/questions/4188706/sentiment-analysis-dictionaries
- https://r4thehumanities.home.blog/dictionary-based-approaches-including-sentiment-analysis/
- https://bookdown.org/f_lennert/text-mining-book/sentimentanalysis.html
- https://ceur-ws.org/Vol-1743/paper9.pdf
- https://link.springer.com/article/10.1007/s11135-024-01896-9

### Простые модели, BoW

**Preprocessing**

In [7]:
tfidf_vectorizer = TfidfVectorizer(max_features=100)
X = tfidf_vectorizer.fit_transform(X).toarray()

**BoW + Наивный Байес**

[Описание](https://scikit-learn.org/1.5/modules/naive_bayes.html)

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)
gnb = GaussianNB()
y_pred = gnb.fit(X_train, y_train).predict(X_test)
print("Number of mislabeled points out of a total %d points : %d"
      % (X_test.shape[0], (y_test != y_pred).sum()))

Number of mislabeled points out of a total 4133 points : 1512


**BoW + $k$ ближайших соседей**

[Описание](https://scikit-learn.org/1.5/modules/naive_bayes.html)

**BoW + Логистическая регрессия**

[Описание](https://scikit-learn.org/1.5/modules/naive_bayes.html)

### Модели на основе деревьев

In [9]:
from sklearn.metrics import roc_auc_score

In [24]:
def pr_diff(clf):
    train_scores = []
    test_scores = []

    for i in range(4):
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.7, random_state=0)
        clf.fit(X_train, y_train)
        
        train_score = roc_auc_score(y_train, clf.predict_proba(X_train)[:,1])
        test_score = roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])

        train_scores.append(train_score)
        test_scores.append(test_score)
    
    return np.abs(np.max(train_scores) - np.min(test_scores))

In [25]:
from sklearn.model_selection import cross_val_score

In [27]:
import optuna, catboost

In [45]:
(pd.DataFrame(X).dtypes == np.float64).all()

True

In [53]:
def objective(trial):
    params = {
        'num_trees': trial.suggest_int('num_trees', 5, 100),
        'max_depth': trial.suggest_int('max_depth', 1, 10),
        'min_data_in_leaf': trial.suggest_int('min_data_in_leaf', 1, 100),
        'verbose': False
    }

    model = catboost.CatBoostClassifier(**params)
    metrics = [np.mean(cross_val_score(model, X=pd.DataFrame(X), y=y, scoring='roc_auc', cv=10)), pr_diff(model)]

    return metrics

In [54]:
study = optuna.create_study(
    directions=['maximize', 'minimize'],
    sampler=optuna.samplers.TPESampler()
)

[I 2024-12-06 17:54:38,781] A new study created in memory with name: no-name-30111124-c96c-4262-bf99-3e1ac58ca861


In [38]:
!pip list | grep -e "catboost.*"

553.14s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


catboost           1.2.7


In [55]:
study.optimize(
    func=objective,
    n_trials=10,
    timeout=600,
    show_progress_bar=True,
    gc_after_trial=True
)

 10%|██▌                       | 1/10 [00:01<00:17,  1.96s/it, 1.96/600 seconds]

[I 2024-12-06 17:54:42,163] Trial 0 finished with values: [0.7187291298910364, 0.10770356021765004] and parameters: {'num_trees': 38, 'max_depth': 3, 'min_data_in_leaf': 76}.


 20%|█████▏                    | 2/10 [00:03<00:14,  1.78s/it, 3.62/600 seconds]

[I 2024-12-06 17:54:43,834] Trial 1 finished with values: [0.7099059696367679, 0.09752740516569369] and parameters: {'num_trees': 22, 'max_depth': 3, 'min_data_in_leaf': 63}.


 20%|█████▏                    | 2/10 [00:11<00:47,  5.99s/it, 3.62/600 seconds]

[W 2024-12-06 17:54:52,182] Trial 2 failed with parameters: {'num_trees': 40, 'max_depth': 10, 'min_data_in_leaf': 48} because of the following error: KeyboardInterrupt('').
Traceback (most recent call last):
  File "/opt/homebrew/Caskroom/miniconda/base/envs/course-work-venv/lib/python3.11/site-packages/optuna/study/_optimize.py", line 197, in _run_trial
    value_or_values = func(trial)
                      ^^^^^^^^^^^
  File "/var/folders/0k/pk7zyw6x2sqcjpbqvgf0ssh80000gn/T/ipykernel_25662/3582330355.py", line 10, in objective
    metrics = [np.mean(cross_val_score(model, X=pd.DataFrame(X), y=y, scoring='roc_auc', cv=10)), pr_diff(model)]
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/course-work-venv/lib/python3.11/site-packages/sklearn/utils/_param_validation.py", line 216, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/m




KeyboardInterrupt: 

In [56]:
study.trials_dataframe().sort_values(['values_0', 'values_1'], ascending=[False, True])

Unnamed: 0,number,values_0,values_1,datetime_start,datetime_complete,duration,params_max_depth,params_min_data_in_leaf,params_num_trees,state
0,0,0.718729,0.107704,2024-12-06 17:54:40.275254,2024-12-06 17:54:42.162937,0 days 00:00:01.887683,3,76,38,COMPLETE
1,1,0.709906,0.097527,2024-12-06 17:54:42.245397,2024-12-06 17:54:43.834692,0 days 00:00:01.589295,3,63,22,COMPLETE
2,2,,,2024-12-06 17:54:43.904506,2024-12-06 17:54:52.181906,0 days 00:00:08.277400,10,48,40,FAIL
