# Д.З. №7. Анализ отзывов на лекарства

## Данные

Ссылка на датасет: [Kaggle: Drug Reviews](https://www.kaggle.com/datasets/jessicali9530/kuc-hackathon-winter-2018)

## Задание

Реализовать модель для прогноза колонки `rating` на основе текста из колонки `review`. Рассматриваем это как задачу **регрессии**.

Необходимо применить **глубокую NLP-модель**. Подход можно выбрать самостоятельно:  
- RNN (например, LSTM, GRU)  
- трансформеры (например, BERT, RoBERTa и т.д.)

---

## Дополнительные (опциональные) задачи

Можно выполнить одну или несколько дополнительных задач. Это не обязательно, но может значительно повысить итоговый балл.

### 1. Тематическое моделирование (Topic Modeling)

Попробовать тематическую кластеризацию текстов из `review`:

- Сначала получить **эмбеддинги текста** (подходящие примеры — в конце страницы с заданием).
- Использовать **англоязычные модели**, так как тексты на английском.
- Провести кластеризацию эмбеддингов. Возможные методы:
  - C-TF-IDF
  - LDA (можно использовать реализацию с трансформерными эмбеддингами)

**Как оценить качество кластеров:**  
Посмотрите на ключевые слова (или фразы) внутри каждого кластера. Если они логично и последовательно отражают общую тему отзывов — кластеризация считается удачной.

### 2. Дистилляция трансформера в LSTM

Можно попробовать **дистиллировать большую трансформерную модель** в более лёгкую LSTM-архитектуру:

- Либо использовать заготовленный код (ссылка в конце задания)
- Либо найти/адаптировать готовое решение

Это может быть полезно для ускорения вывода модели или снижения требований к ресурсам.


## Загрузка и первичная проверка датасета

На этом шаге мы скачиваем датасет с Kaggle и загружаем train и test CSV файлы.
Данные читаются целиком без фильтрации строк и колонок.
Мы проверяем размеры выборок, базовую структуру и визуально осматриваем первые записи.
Никакая очистка или предобработка данных здесь не выполняется.


In [1]:
from pathlib import Path
from IPython.display import Markdown, display

from scripts.dataset import download_dataset, load_raw_splits


def md(text: str) -> None:
    display(Markdown(text))


project_root = Path.cwd().resolve()
raw_data_dir = project_root / "data" / "raw"

download_dataset(
    dataset_id="jessicali9530/kuc-hackathon-winter-2018",
    target_dir=raw_data_dir,
    force=False,
)

train_csv = raw_data_dir / "drugsComTrain_raw.csv"
test_csv = raw_data_dir / "drugsComTest_raw.csv"

train_df, test_df = load_raw_splits(
    train_path=train_csv,
    test_path=test_csv,
)

md("### Train dataset overview")
md(f"Shape: {train_df.shape}")
display(train_df.head())

md("### Test dataset overview")
md(f"Shape: {test_df.shape}")
display(test_df.head())


Dataset already exists. Skipping download.

Loading training dataset

Loading dataset from `/home/garret/git/mfti/llm_hw7_medicine_reviews/data/raw/drugsComTrain_raw.csv`

Loaded dataframe with 161297 rows and 7 columns

Ratings converted to numeric. Missing values after conversion: 0

Loading test dataset

Loading dataset from `/home/garret/git/mfti/llm_hw7_medicine_reviews/data/raw/drugsComTest_raw.csv`

Loaded dataframe with 53766 rows and 7 columns

Ratings converted to numeric. Missing values after conversion: 0

Finished loading train and test datasets

### Train dataset overview

Shape: (161297, 7)

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9,27-Nov-16,37


### Test dataset overview

Shape: (53766, 7)

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10,28-Feb-12,22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8,17-May-09,17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9,29-Sep-17,3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9,5-Mar-17,35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9,22-Oct-15,4


## Краткие наблюдения по данным

Датасет уже разделён на train и test, что сразу упрощает корректную оценку моделей и снижает риск утечек данных.

Train выборка содержит около 161 тысяч записей, test — около 54 тысяч. В обоих файлах одинаковая структура и по 7 колонок.

Ключевые колонки для основной задачи — `review` и `rating`. Тексты отзывов на английском языке, разной длины, с разговорной лексикой и следами пользовательского ввода. Рейтинги представлены целыми значениями в диапазоне от 1 до 10, пропусков после приведения типов не обнаружено.

Дополнительные колонки `drugName`, `condition`, `date` и `usefulCount` потенциально содержат полезную информацию, но на текущем этапе не используются и сохраняются без изменений для дальнейшего анализа.

На данном шаге данные не очищались и не модифицировались. Все дальнейшие решения по фильтрации, очистке или расширению признаков будут приниматься отдельно и осознанно.


## Диагностика датасета

На этом шаге мы не чистим данные и ничего не удаляем.
Мы фиксируем факты про качество и структуру данных, пропуски, дубликаты, корректность дат, распределение рейтинга и базовую статистику длины текстов.
Вся логика анализа находится в scripts, ноутбук только оркестрирует вызовы и отображение.


In [2]:
from scripts.inspection import run_inspection

run_inspection(train_df=train_df, test_df=test_df)


## Train inspection

Shape: (161297, 7)

### Head

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combination of Bystolic 5 Mg and Fish Oil""",9,20-May-12,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of Intuniv. We became concerned when he began this last week, when he started taking the highest dose he will be on. For two days, he could hardly get out of bed, was very cranky, and slept for nearly 8 hours on a drive home from school vacation (very unusual for him.) I called his doctor on Monday morning and she said to stick it out a few days. See how he did at school, and with getting up in the morning. The last two days have been problem free. He is MUCH more agreeable than ever. He is less emotional (a good thing), less cranky. He is remembering all the things he should. Overall his behavior is better. We have tried many different medications and so far this is the most effective.""",8,27-Apr-10,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, which had 21 pill cycle, and was very happy- very light periods, max 5 days, no other side effects. But it contained hormone gestodene, which is not available in US, so I switched to Lybrel, because the ingredients are similar. When my other pills ended, I started Lybrel immediately, on my first day of period, as the instructions said. And the period lasted for two weeks. When taking the second pack- same two weeks. And now, with third pack things got even worse- my third period lasted for two weeks and now it's the end of the third week- I still have daily brown discharge. The positive side is that I didn't have any other side effects. The idea of being period free was so tempting... Alas.""",5,14-Dec-09,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth control. I'm glad I went with the patch, I have been on it for 8 months. At first It decreased my libido but that subsided. The only downside is that it made my periods longer (5-6 days to be exact) I used to only have periods for 3-4 days max also made my cramps intense for the first two days of my period, I never had cramps before using birth control. Other than that in happy with the patch""",8,3-Nov-15,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around. I feel healthier, I'm excelling at my job and I always have money in my pocket and my savings account. I had none of those before Suboxone and spent years abusing oxycontin. My paycheck was already spent by the time I got it and I started resorting to scheming and stealing to fund my addiction. All that is history. If you're ready to stop, there's a good chance that suboxone will put you on the path of great life again. I have found the side-effects to be minimal compared to oxycontin. I'm actually sleeping better. Slight constipation is about it for me. It truly is amazing. The cost pales in comparison to what I spent on oxycontin.""",9,27-Nov-16,37


### Missing values and dtypes

Unnamed: 0,dtype,missing_count,missing_share,nunique
condition,object,899,0.005574,884
uniqueID,int64,0,0.0,161297
drugName,object,0,0.0,3436
review,object,0,0.0,112329
rating,int64,0,0.0,10
date,object,0,0.0,3579
usefulCount,int64,0,0.0,389


### Duplicates

Unnamed: 0,key,subset,duplicate_rows,duplicate_share
0,full_row_duplicates,all_columns,0,0.0
1,review_duplicates,review,48968,0.303589
2,review_rating_duplicates,"review, rating",48879,0.303037
3,drug_condition_review_duplicates,"drugName, condition, review",50,0.00031


Conflicting ratings for same review: 72

Unnamed: 0,review,distinct_ratings
13552,"""Good""",6
13558,"""Good.""",5
14147,"""Great""",4
9681,"""Did not work well for me.""",3
79307,"""It works.""",3


Parsing date column using non strict pandas parser

### Dates

Invalid dates: 0 (0.000000)

Date range: 2008-02-24 00:00:00 to 2017-12-12 00:00:00

### Rating

min 1.0, max 10.0, mean 6.994376832799122, median 8.0, missing 0

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,21619
2,6931
3,6513
4,5012
5,8013
6,6343
7,9456
8,18890
9,27531
10,50989


### Review text length

Empty like reviews: 0 (0.000000)

Unnamed: 0,chars
0.0,3.0
0.25,262.0
0.5,455.0
0.75,691.0
0.9,758.0
0.95,770.0
0.99,795.0
1.0,10787.0


Unnamed: 0,words
0.0,1.0
0.25,48.0
0.5,84.0
0.75,126.0
0.9,141.0
0.95,146.0
0.99,154.0
1.0,1894.0


Over 512 words: 31 (0.000192), over 1024 words: 5 (0.000031)

## Test inspection

Shape: (53766, 7)

### Head

Unnamed: 0,uniqueID,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I've tried a few antidepressants over the years (citalopram, fluoxetine, amitriptyline), but none of those helped with my depression, insomnia & anxiety. My doctor suggested and changed me onto 45mg mirtazapine and this medicine has saved my life. Thankfully I have had no side effects especially the most common - weight gain, I've actually lost alot of weight. I still have suicidal thoughts but mirtazapine has saved me.""",10,28-Feb-12,22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn's disease and has done very well on the Asacol. He has no complaints and shows no side effects. He has taken as many as nine tablets per day at one time. I've been very happy with the results, reducing his bouts of diarrhea drastically.""",8,17-May-09,17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9,29-Sep-17,3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for alcohol, smoking, and opioid cessation. People lose weight on it because it also helps control over-eating. I have no doubt that most obesity is caused from sugar/carb addiction, which is just as powerful as any drug. I have been taking it for five days, and the good news is, it seems to go to work immediately. I feel hungry before I want food now. I really don't care to eat; it's just to fill my stomach. Since I have only been on it a few days, I don't know if I've lost weight (I don't have a scale), but my clothes do feel a little looser, so maybe a pound or two. I'm hoping that after a few months on this medication, I will develop healthier habits that I can continue without the aid of Contrave.""",9,5-Mar-17,35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cycle. After reading some of the reviews on this type and similar birth controls I was a bit apprehensive to start. Im giving this birth control a 9 out of 10 as I have not been on it long enough for a 10. So far I love this birth control! My side effects have been so minimal its like Im not even on birth control! I have experienced mild headaches here and there and some nausea but other than that ive been feeling great! I got my period on cue on the third day of the inactive pills and I had no idea it was coming because I had zero pms! My period was very light and I barely had any cramping! I had unprotected sex the first month and obviously didn't get pregnant so I'm very pleased! Highly recommend""",9,22-Oct-15,4


### Missing values and dtypes

Unnamed: 0,dtype,missing_count,missing_share,nunique
condition,object,295,0.005487,708
uniqueID,int64,0,0.0,53766
drugName,object,0,0.0,2637
review,object,0,0.0,48280
rating,int64,0,0.0,10
date,object,0,0.0,3566
usefulCount,int64,0,0.0,325


### Duplicates

Unnamed: 0,key,subset,duplicate_rows,duplicate_share
0,full_row_duplicates,all_columns,0,0.0
1,review_duplicates,review,5486,0.102035
2,review_rating_duplicates,"review, rating",5464,0.101626
3,drug_condition_review_duplicates,"drugName, condition, review",8,0.000149


Conflicting ratings for same review: 18

Unnamed: 0,review,distinct_ratings
5786,"""Good.""",3
5784,"""Good""",3
47193,"""Works great""",3
45629,"""Very good""",3
5758,"""Good medicine""",2


Parsing date column using non strict pandas parser

### Dates

Invalid dates: 0 (0.000000)

Date range: 2008-02-25 00:00:00 to 2017-12-12 00:00:00

### Rating

min 1.0, max 10.0, mean 6.97689989956478, median 8.0, missing 0

Unnamed: 0_level_0,count
rating,Unnamed: 1_level_1
1,7299
2,2334
3,2205
4,1659
5,2710
6,2119
7,3091
8,6156
9,9177
10,17016


### Review text length

Empty like reviews: 0 (0.000000)

Unnamed: 0,chars
0.0,3.0
0.25,262.0
0.5,457.0
0.75,689.0
0.9,758.0
0.95,770.0
0.99,794.0
1.0,6192.0


Unnamed: 0,words
0.0,1.0
0.25,48.0
0.5,84.0
0.75,126.0
0.9,141.0
0.95,146.0
0.99,154.0
1.0,1162.0


Over 512 words: 2 (0.000037), over 1024 words: 1 (0.000019)