### <h2 style="font-size: 40px;">Состояние пациента по отзыву на лекарство</h2>

In [46]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns
import re
import torch

#### Техническое задание



#### Описание данных

- `uniqueID` - уникальный идентификатор пользователя
- `drugName` - 	название лекарственного средства
- `condition` - место применения
- `review` - отзыв пользователя 
- `rating` - рейтинг пациента на 10 звезд
- `date` - дата публикации отзыва
- `usefulCount` - количество пользователей, которые сочли отзыв корректным

### Загрузка данных

In [47]:
# Загрузка и распаковка датасетов

#!wget "https://archive.ics.uci.edu/ml/machine-learning-databases/00462/drugsCom_raw.zip"
#!unzip drugsCom_raw.zip -d ./datasets/
#!rm drugsCom_raw.zip

In [48]:
# Функция для быстрого ознакомления и парсинга таблицы
def parse_dataset(path):
    df = pd.read_csv(path, sep='\t', parse_dates=['date'])
    display(df.head())

    display(round(df.isna().mean() * 100, 2).sort_values(ascending=False).to_frame(name='Процент пропусков'))

    df.info()
    return df


# Функция для проверки наличия файла
def check_dataset(path):    
    if os.path.exists(path):
        return parse_dataset(path)
    else:
        print('Неправильный путь к файлу')

In [49]:
drugs_data = check_dataset('./datasets/drugsComTrain_raw.tsv')

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,206461,Valsartan,Left Ventricular Dysfunction,"""It has no side effect, I take it in combinati...",9.0,2012-05-20,27
1,95260,Guanfacine,ADHD,"""My son is halfway through his fourth week of ...",8.0,2010-04-27,192
2,92703,Lybrel,Birth Control,"""I used to take another oral contraceptive, wh...",5.0,2009-12-14,17
3,138000,Ortho Evra,Birth Control,"""This is my first time using any form of birth...",8.0,2015-11-03,10
4,35696,Buprenorphine / naloxone,Opiate Dependence,"""Suboxone has completely turned my life around...",9.0,2016-11-27,37


Unnamed: 0,Процент пропусков
condition,0.56
Unnamed: 0,0.0
drugName,0.0
review,0.0
rating,0.0
date,0.0
usefulCount,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 161297 entries, 0 to 161296
Data columns (total 7 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Unnamed: 0   161297 non-null  int64         
 1   drugName     161297 non-null  object        
 2   condition    160398 non-null  object        
 3   review       161297 non-null  object        
 4   rating       161297 non-null  float64       
 5   date         161297 non-null  datetime64[ns]
 6   usefulCount  161297 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 8.6+ MB


При загрузки датасета признак `uniqueID` был загружен как `	Unnamed: 0`, дата была переведена в формат datetime64. В датасете наблюдается 0.56% пропусков в столбце с местом применения `condition`. 

In [50]:
drugs_data_test = check_dataset('./datasets/drugsComTest_raw.tsv')

Unnamed: 0.1,Unnamed: 0,drugName,condition,review,rating,date,usefulCount
0,163740,Mirtazapine,Depression,"""I&#039;ve tried a few antidepressants over th...",10.0,2012-02-28,22
1,206473,Mesalamine,"Crohn's Disease, Maintenance","""My son has Crohn&#039;s disease and has done ...",8.0,2009-05-17,17
2,159672,Bactrim,Urinary Tract Infection,"""Quick reduction of symptoms""",9.0,2017-09-29,3
3,39293,Contrave,Weight Loss,"""Contrave combines drugs that were used for al...",9.0,2017-03-05,35
4,97768,Cyclafem 1 / 35,Birth Control,"""I have been on this birth control for one cyc...",9.0,2015-10-22,4


Unnamed: 0,Процент пропусков
condition,0.55
Unnamed: 0,0.0
drugName,0.0
review,0.0
rating,0.0
date,0.0
usefulCount,0.0


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53766 entries, 0 to 53765
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Unnamed: 0   53766 non-null  int64         
 1   drugName     53766 non-null  object        
 2   condition    53471 non-null  object        
 3   review       53766 non-null  object        
 4   rating       53766 non-null  float64       
 5   date         53766 non-null  datetime64[ns]
 6   usefulCount  53766 non-null  int64         
dtypes: datetime64[ns](1), float64(1), int64(2), object(3)
memory usage: 2.9+ MB


В отзывах на лекарства наблюдается спец-символ *\&#039;*, который является апострофом ('). В датасете наблюдается 0.55% пропусков в столбце с местом применения `condition`. 

#### Предобработка данных

In [51]:
drugs_data.rename({'Unnamed: 0':'id'}, axis=1, inplace=True)
drugs_data_test.rename({'Unnamed: 0':'id'}, axis=1, inplace=True)

Название колонки уникального идентификатора пользователя было переименовано в `id`.

In [62]:
drugs_data.duplicated(subset=["id"]).sum()

np.int64(0)

В тренировочном датасете отсутствуют неявные дубликаты.

In [52]:
for i in range(10, 15):
    print(drugs_data['review'][i])
    print()

"I have been on this medication almost two weeks, started out on 25mg and working my way up to 100mg, currently at 50mg. No headaches at all so far and I was having 2-3 crippling migraines a week. I have lost 5.2lbs so far but note I am really paying close attention to what I am eating, I have a lot of weight to lose and if weight loss is a side effect I want to help it along as much as I can.  Now, other side effects, they are there the word recall issues exist, the memory issues, the worst of it seems to be the vision disturbances, there have been times I have just not driven because I&#039;m sure it would not have been safe. The good news is it seems to be wearing off...I have tons of energy and I am in a great mood."

"I have taken anti-depressants for years, with some improvement but mostly moderate to severe side affects, which makes me go off them.

I only take Cymbalta now mostly for pain.

When I began Deplin, I noticed a major improvement overnight. More energy, better dispos

В отзывах наблюдается специальный символ *\&#039;*. Также имеются повторяющиеся пробелы.

In [53]:
# Функция для очистки отзывов пользователей
def get_clean_text(row_review):
    row_review = row_review.replace('&#039;', "'")
    row_review = re.sub(r'[(),.!?]', '', row_review)
    row_review = re.sub(r'\s+', ' ', row_review)
    return row_review.lower()

In [54]:
drugs_data['review'] = drugs_data['review'].apply(get_clean_text)
drugs_data_test['review'] = drugs_data_test['review'].apply(get_clean_text)

Отзывы тренировочного и тестового датасетов были предобработаны. Специальный символ изменен на апостроф. Скобки, знаки препинания и повторяющиеся пробелы были удалены.

### Исследовательский анализ