1. Создайте мешок слов с помощью ``sklearn.feature_extraction.text.CountVectorizer.fit_transform()``. Применим его к ``'tweet_stemmed'`` и ``'tweet_lemmatized'`` отдельно.
- Игнорируем слова, частота которых в документе строго превышает порог 0.9 с помощью ``max_df``.
- Ограничим количество слов, попадающий в мешок, с помощью ``max_features = 1000``.
- Исключим стоп-слова с помощью ``stop_words='english'``.
- Отобразим Bag-of-Words модель как DataFrame. columns необходимо извлечь с помощью ``CountVectorizer.get_feature_names()``.
2. Создайте мешок слов с помощью ``sklearn.feature_extraction.text.TfidfVectorizer.fit_transform()``. Применим его к ``'tweet_stemmed'`` и ``'tweet_lemmatized'`` отдельно.
- Игнорируем слова, частота которых в документе строго превышает порог 0.9 с помощью ``max_df``.
- Ограничим количество слов, попадающий в мешок, с помощью ``max_features = 1000``.
- Исключим стоп-слова с помощью ``stop_words='english'``.
- Отобразим Bag-of-Words модель как DataFrame. columns необходимо извлечь с помощью ``TfidfVectorizer.get_feature_names()``.
3. Натренируем ``gensim.models.Word2Vec`` модель на наших данных.
- Тренировать будем на токенизированных твитах ``combine_df['tweet_token']``
- Установим следующие параметры: ``size=200, window=5, min_count=2, sg = 1, hs = 0, negative = 10, workers= 32, seed = 34``.
- Используем функцию ``train()`` с параметром ``total_examples`` равным длине ``combine_df['tweet_token']``, количество ``epochs`` установим 20.
4. Давайте немного потестируем нашу модель Word2Vec и посмотрим, как она работает. Мы зададим слово ``positive = "dinner"``, и модель вытащит из корпуса наиболее похожие слова c помощью функции ``most_similar``. То же самое попробуем со словом ``"trump"``.

5. Из приведенных выше примеров мы видим, что наша модель word2vec хорошо справляется с поиском наиболее похожих слов для данного слова. Но как она это делает? Она изучила векторы для каждого уникального слова наших данных и использует косинусное сходство, чтобы найти наиболее похожие векторы (слова).
    
    Давайте проверим векторное представление любого слова из нашего корпуса, например ``"food"``.


6. Поскольку наши данные содержат твиты, а не только слова, нам придется придумать способ использовать векторы слов из модели word2vec для создания векторного представления всего твита. Существует простое решение этой проблемы, мы можем просто взять среднее значение всех векторов слов, присутствующих в твите. Длина результирующего вектора будет одинаковой, то есть 200. Мы повторим тот же процесс для всех твитов в наших данных и получим их векторы. Теперь у нас есть 200 функций ``word2vec`` для наших данных.

    Необходимо создать вектор для каждого твита, взяв среднее значение векторов слов, присутствующих в твите. 
    
    В цикле сделать:  ``vec += model_w2v[word].reshape((1, size))`` и поделить финальный вектор на количество слов в твите.На выходе должен получиться ``wordvec_df.shape = (49159, 200)``.



# Импорты

In [61]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
import os
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

# Загрузка данных

In [2]:
df = pd.read_pickle('../Lesson_01/lesson_01_preprocessing.pkl')
df.head()

Unnamed: 0,id,label,tweet,clear_tweet,tweet_token,tweet_token_filtered,tweet_stemmed,tweet_lemmatized
0,1,0.0,@user when a father is dysfunctional and is s...,user when father is dysfunctional and is so se...,"[user, when, father, is, dysfunctional, and, i...","[user, father, dysfunctional, selfish, drags, ...","[user, father, dysfunct, selfish, drag, kid, d...","[user, father, dysfunct, selfish, drag, kid, d..."
1,2,0.0,@user @user thanks for #lyft credit i can't us...,user user thanks for lyft credit can use cause...,"[user, user, thanks, for, lyft, credit, can, u...","[user, user, thanks, lyft, credit, use, cause,...","[user, user, thank, lyft, credit, use, caus, o...","[user, user, thank, lyft, credit, use, caus, o..."
2,3,0.0,bihday your majesty,bihday your majesty,"[bihday, your, majesty]","[bihday, majesty]","[bihday, majesti]","[bihday, majesti]"
3,4,0.0,#model i love u take with u all the time in ...,model love take with all the time in ur,"[model, love, take, with, all, the, time, in, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]","[model, love, take, time, ur]"
4,5,0.0,factsguide: society now #motivation,factsguide society now motivation,"[factsguide, society, now, motivation]","[factsguide, society, motivation]","[factsguid, societi, motiv]","[factsguid, societi, motiv]"


# Задания
## Создайте мешок слов 
с помощью ``sklearn.feature_extraction.text.CountVectorizer.fit_transform()``. Применим его к ``'tweet_stemmed'`` и ``'tweet_lemmatized'`` отдельно.
- Игнорируем слова, частота которых в документе строго превышает порог 0.9 с помощью ``max_df``.
- Ограничим количество слов, попадающий в мешок, с помощью ``max_features = 1000``.
- Исключим стоп-слова с помощью ``stop_words='english'``.
- Отобразим Bag-of-Words модель как DataFrame. columns необходимо извлечь с помощью ``CountVectorizer.get_feature_names()``

In [50]:
res_stemmed = [" ".join(w) for w in df['tweet_stemmed']]

count_vectorizer = CountVectorizer(ngram_range=(1, 2), 
                                   analyzer='word', 
                                   binary=False,
                                   max_df=0.9, 
                                   max_features = 1000, 
                                   stop_words='english')

# Создаем the Bag-of-Words модель
bow_tweet_stemmed = count_vectorizer.fit_transform(res_stemmed)

# Отобразим Bag-of-Words модель как DataFrame
feature_names_tweet_stemmed = count_vectorizer.get_feature_names()

feature_names_tweet_stemmed_df = pd.DataFrame(bow_tweet_stemmed.toarray(), columns = feature_names_tweet_stemmed)
feature_names_tweet_stemmed_df.head(5)

['user father dysfunct selfish drag kid dysfunct run',
 'user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank']

In [57]:
res_lemmatized = [" ".join(w) for w in df['tweet_lemmatized']]

count_vectorizer = CountVectorizer(ngram_range=(1, 2), 
                                   analyzer='word', 
                                   binary=False,
                                   max_df=0.9, 
                                   max_features = 1000, 
                                   stop_words='english')

# Создаем the Bag-of-Words модель
bow_tweet_lemmatized = count_vectorizer.fit_transform(res_lemmatized)

# Отобразим Bag-of-Words модель как DataFrame
feature_names_tweet_lemmatized = count_vectorizer.get_feature_names()

feature_names_tweet_lemmatized_df = pd.DataFrame(bow_tweet_lemmatized.toarray(), 
                                                 columns = feature_names_tweet_lemmatized)
feature_names_tweet_lemmatized_df.head()

['user father dysfunct selfish drag kid dysfunct run',
 'user user thank lyft credit use caus offer wheelchair van pdx disapoint getthank']

## Создайте мешок слов 
с помощью sklearn.feature_extraction.text.TfidfVectorizer.fit_transform().
Применим его к 'tweet_stemmed' и 'tweet_lemmatized' отдельно.
- Игнорируем слова, частота которых в документе строго превышает порог 0.9 с помощью ``max_df``.
- Ограничим количество слов, попадающий в мешок, с помощью ``max_features = 1000``.
- Исключим стоп-слова с помощью ``stop_words='english'``.
- Отобразим Bag-of-Words модель как DataFrame. columns необходимо извлечь с помощью ``TfidfVectorizer.get_feature_names()``.

In [62]:
res_stemmed = [" ".join(w) for w in df['tweet_stemmed']]

count_vectorizer = TfidfVectorizer(ngram_range=(1, 2), 
                                   analyzer='word', 
                                   binary=False,
                                   max_df=0.9, 
                                   max_features = 1000, 
                                   stop_words='english')

# Создаем the Bag-of-Words модель
bow_tfidf_tweet_stemmed = count_vectorizer.fit_transform(res_stemmed)

# Отобразим Bag-of-Words модель как DataFrame
feature_names_tweet_stemmed_tfidf = count_vectorizer.get_feature_names()

feature_names_tweet_stemmed_df_tfidf = pd.DataFrame(bow_tfidf_tweet_stemmed.toarray(), 
                                                    columns = feature_names_tweet_stemmed_tfidf)
feature_names_tweet_stemmed_df_tfidf.head(5)

Unnamed: 0,abl,absolut,accept,account,act,action,actor,actual,ad,adapt,...,ye,yeah,yeah good,year,yesterday,yo,yoga,young,youtub,yr
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
res_lemmatized = [" ".join(w) for w in df['tweet_lemmatized']]

count_vectorizer = TfidfVectorizer(ngram_range=(1, 2), 
                                   analyzer='word', 
                                   binary=False,
                                   max_df=0.9, 
                                   max_features = 1000, 
                                   stop_words='english')

# Создаем the Bag-of-Words модель
bow_tfidf_tweet_lemmatized = count_vectorizer.fit_transform(res_lemmatized)

# Отобразим Bag-of-Words модель как DataFrame
feature_names_tweet_lemmatized_tfidf = count_vectorizer.get_feature_names()

feature_names_tweet_lemmatized_df_tfidf = pd.DataFrame(bow_tfidf_tweet_lemmatized.toarray(), 
                                                    columns = feature_names_tweet_lemmatized_tfidf)
feature_names_tweet_lemmatized_df_tfidf.head(5)

Unnamed: 0,abl,absolut,accept,account,act,action,actor,actual,ad,adapt,...,ye,yeah,yeah good,year,yesterday,yo,yoga,young,youtub,yr
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Натренируем ``gensim.models.Word2Vec`` модель на наших данных.
- Тренировать будем на токенизированных твитах ``combine_df['tweet_token']``
- Установим следующие параметры: ``size=200, window=5, min_count=2, sg = 1, hs = 0, negative = 10, workers= 32, seed = 34``.
- Используем функцию ``train()`` с параметром ``total_examples`` равным длине ``combine_df['tweet_token']``, количество ``epochs`` установим 20.