# Day 90 - CountVectorizer & TfidfVectorizer

1. The following list with text documents is given: <br>
<br>
documents = [ <br>
    'python is a programming language', <br>
    'python is popular', <br>
    'programming in python', <br>
    'object-oriented programming in python' <br>
] <br>
<br>
Vectorize your documents with the CountVectorizer class from the scikit-learn. Use the stop_words argument and set its value to 'english'.

In [1]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
documents = [
    'python is a programming language',
    'python is popular',
    'programming in python',
    'object-oriented programming in python',
]
 
vectorizer = CountVectorizer(stop_words='english')
 
df = pd.DataFrame(
    data=vectorizer.fit_transform(documents).toarray(),
    columns=vectorizer.get_feature_names(),
)
print(df)

   language  object  oriented  popular  programming  python
0         1       0         0        0            1       1
1         0       0         0        1            0       1
2         0       0         0        0            1       1
3         0       1         1        0            1       1




2. The following list with text documents is given: <br>
<br>
documents = [ <br>
    'python is a programming language', <br>
    'python is popular', <br>
    'programming in python', <br>
    'object-oriented programming in python' <br>
] <br>
<br>
Vectorize your documents with the CountVectorizer class from the scikit-learn. Use the stop_words argument and set its value to 'english'. Also set the appropriate argument that allows you to extract n-grams: unigrams and bigrams.

In [2]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
 
pd.set_option('display.width', 200)
pd.set_option('display.max_columns', 20)
documents = [
    'python is a programming language',
    'python is popular',
    'programming in python',
    'object-oriented programming in python',
    'programming language',
]
 
vectorizer = CountVectorizer(stop_words='english', ngram_range=(1, 2))
 
df = pd.DataFrame(
    data=vectorizer.fit_transform(documents).toarray(),
    columns=vectorizer.get_feature_names(),
)
print(df)

   language  object  object oriented  oriented  oriented programming  popular  programming  programming language  programming python  python  python popular  python programming
0         1       0                0         0                     0        0            1                     1                   0       1               0                   1
1         0       0                0         0                     0        1            0                     0                   0       1               1                   0
2         0       0                0         0                     0        0            1                     0                   1       1               0                   0
3         0       1                1         1                     1        0            1                     0                   1       1               0                   0
4         1       0                0         0                     0        0            1                     1   



3. The following list with text documents is given: <br>
<br>
documents = [ <br>
    'python is a programming language', <br>
    'python is popular', <br>
    'programming in python', <br>
    'object-oriented programming in python' <br>
] <br>
<br>
Vectorize the given documents using the TfidfVectorizer class from the scikit-learn.

In [4]:
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

documents = [
    'python is a programming language',
    'python is popular',
    'programming in python',
    'object-oriented programming in python',
    'programming language',
]
 
tfidf_vectorizer = TfidfVectorizer()
 
df = pd.DataFrame(
    data=tfidf_vectorizer.fit_transform(documents).toarray(),
    columns=tfidf_vectorizer.get_feature_names(),
)
print(df)

         in        is  language    object  oriented   popular  programming    python
0  0.000000  0.579748  0.579748  0.000000  0.000000  0.000000     0.404837  0.404837
1  0.000000  0.575063  0.000000  0.000000  0.000000  0.712775     0.000000  0.401565
2  0.711525  0.000000  0.000000  0.000000  0.000000  0.000000     0.496856  0.496856
3  0.445090  0.000000  0.000000  0.551677  0.551677  0.000000     0.310805  0.310805
4  0.000000  0.000000  0.819887  0.000000  0.000000  0.000000     0.572526  0.000000




4. Load the clusters.csv file into the DataFrame and assign to df variable. <br>
Using the AgglomerativeClustering class from the scikit-learn, create a model to split given dataset into two clusters. Make a prediction based on this model and assign a new column 'cluster' which stores the cluster number for each sample in the df DataFrame.<br>
In response, print the first ten rows of the df DataFrame to the console.

In [5]:
import numpy as np
import pandas as pd
from sklearn.cluster import AgglomerativeClustering
 
df = pd.read_csv('clusters.csv')
 
cluster = AgglomerativeClustering(n_clusters=2)
cluster.fit_predict(df)
 
df = pd.DataFrame(df, columns=['x1', 'x2'])
df['cluster'] = cluster.labels_
print(df.head(10))

         x1        x2  cluster
0 -2.486532  7.025770        0
1 -3.522549  8.578303        0
2 -2.982040  7.998514        0
3 -2.135276  6.255888        0
4  2.762504  4.210918        1
5 -3.541472  8.489106        0
6  1.240259  0.781640        1
7  0.053390  8.966770        0
8 -0.827918  6.742253        0
9  3.291716  1.296751        1
