1. Собираем отзывы на альбомы Канье Уэста с помощью краулера с сайта metacritic.com (выражаю благодарность однокурсникам за наводку на этот сайт)

In [1]:
import requests
from pprint import pprint
from bs4 import BeautifulSoup
from fake_useragent import UserAgent
ua = UserAgent(verify_ssl=False)
session = requests.session()

In [2]:
def parse_page(url):
    req = session.get(url, headers={'User-Agent': ua.random})
    page = req.text
    soup = BeautifulSoup(page, 'html.parser')
    return soup

In [3]:
def get_albums(url): #функция, которая определяет список альбомов исполнителя
    albums = []
    page = parse_page(url)
    r = page.find_all('td', {'class' : 'title brief_metascore'})
    for album in r:
        l = album.find('a').attrs['href']
        albums.append(f'https://www.metacritic.com{l}/user-reviews')
    return albums

In [4]:
k_albums = get_albums('https://www.metacritic.com/person/kanye-west')

In [5]:
def parse_reviews(url): #функция, которая парсит рецензии, на вход подается ссылка на страницу с альбомом
    all_reviews = []
    page = parse_page(url)
    reviews = page.find_all('div', {'class' : 'review_content'})
    for review in reviews:
        dic = {}
        dic['text'] = review.find_all('div', {'class' : 'review_body'})[0].text.strip()
        dic['grade'] = review.find_all('div', {'class' : 'review_grade'})[0].text.strip()
        all_reviews.append(dic)
    return all_reviews

In [6]:
reviews = []
for album in k_albums:
    reviews.extend(parse_reviews(album))

In [7]:
import pandas as pd
data = pd.DataFrame(reviews) #превращаем данные в датафрейм

In [8]:
def grade_to_sentiment(x): #выделяем положительные и отрицательные отзывы
    if x > 10: #некоторые оценки на сайте по 100-балльной шкале
        sent = x / 10
    else:
        sent = x
    if sent <= 5:
        return 0
    else:
        return 1


In [9]:
data['grade'] = data['grade'].astype(float)  
data['sentiment'] = data['grade'].apply(grade_to_sentiment)
data

Unnamed: 0,text,grade,sentiment
0,Immensely mediocre. I hope Ye doesn't make any...,3.0,0
1,It's clearly unfinished. The best is yet to co...,7.0,1
2,"This review contains spoilers, click expand to...",7.0,1
3,"Donda, Donda, DondaDonda, Donda, Donda, Donda,...",10.0,1
4,The album is such a dissappointment I can't ev...,0.0,0
...,...,...,...
1181,Now I usually like the rock music. Actually no...,9.0,1
1182,This album was the BOMB!,10.0,1
1183,Like every hip-hop album (even the great ones)...,70.0,1
1184,Most producers who approach the mic do so at t...,70.0,1


In [10]:
data['sentiment'].value_counts()

1    1017
0     169
Name: sentiment, dtype: int64

2. Токенизируем, лемматизируем, приводим к нижнему регистру и удаляем стоп-слова

In [11]:
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import wordpunct_tokenize
import nltk
stops = set(stopwords.words('english'))
wnl = WordNetLemmatizer()

In [12]:
def lemmatize(x, returnList = True):
    if type(x) != str:
        return ""
    text = wordpunct_tokenize(x)
    result = []
    for word in text:
        if word.isalpha():
            nf = wnl.lemmatize(word).lower()
            if nf not in stops:
                result.append(nf)
    if returnList:
        return(result)
    else:
        return " ".join(result)

In [13]:
data['lemmas'] = data['text'].apply(lemmatize)

In [14]:
positive = data[data['sentiment'] == 1]['lemmas'].to_list()
negative = data[data['sentiment'] == 0]['lemmas'].to_list()

3. Создаем множества

In [15]:
from collections import Counter

In [27]:
def collect_freqlist(reviews, max_len=300): #создаем частотные списки, делаем из них множества
    freqlist = Counter()
    for text in reviews:
        for word in text:
            if word.isalpha():
                freqlist[word] += 1
    l = [*dict(freqlist.most_common(max_len)).keys()]
    freqlist_set = set(l)
    return freqlist_set

In [28]:
pos_set = collect_freqlist(positive)
neg_set = collect_freqlist(negative)
pos_only = pos_set.difference(neg_set) #выделяем множества только позитивных и негативных слов
neg_only = neg_set.difference(pos_set)

4. Создаем функцию для определения тональности отзыва

In [29]:
def predict_sentiment(review): #простейшая функция для определения тональности отзыва
    prep_review = lemmatize(review)
    pos_points = 0
    neg_points = 0
    for word in prep_review:
        if word in pos_only:
            pos_points += 1
        if word in neg_only:
            neg_points += 1
    if pos_points > neg_points:
        return 1
    else:
        return 0


In [30]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    data[['text', 'lemmas']], data[['sentiment']], test_size=0.2, random_state=42)

print(len(X_train), 'training reviews')
print(len(X_test), 'testing reviews')

948 training reviews
238 testing reviews


In [31]:
from sklearn.metrics import accuracy_score

y_train_pred = X_train['text'].apply(predict_sentiment)
print('train accuracy:', accuracy_score(y_train, y_train_pred))

train accuracy: 0.7542194092827004


5. Способы улучшения: подобрать оптимальную частотность для "позитивных" и "негативных слов" при составлении множеств, вместо примитивной функции для определения тональности обучить ML-модель, использовать готовый словарь позитивных/негативных слов. Ну и корпус увеличить тоже не помешает.

Будем использовать логистическую регрессию и tf-idf:

In [32]:
data['for_ml'] = data['text'].apply(lemmatize, returnList = False)

In [33]:
X_train, X_test, y_train, y_test = train_test_split(
    data['for_ml'], data[['sentiment']], test_size=0.2, random_state=42)

In [34]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
vect = TfidfVectorizer(ngram_range=(1,2),min_df=7)
vect.fit(X_train)
vect_X = vect.transform(X_train)
clf = LogisticRegression().fit(vect_X, y_train)

  y = column_or_1d(y, warn=True)


Accuracy стало больше:

In [37]:
y_train_regr = clf.predict(vect.transform(X_test))
print('train accuracy:', accuracy_score(y_test, y_train_regr))

train accuracy: 0.8613445378151261
