# Feature Extraction from text files with scikit-learn

The sklearn.feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image. Reference: http://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

## 1) Loading features from dicts - DictVectorizer
DictVectorizer transforms lists of feature-value mappings to vectors.

This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators.

The class DictVectorizer can be used to convert feature arrays represented as lists of standard Python dict objects to the NumPy/SciPy representation used by scikit-learn estimators.

DictVectorizer implements “one-hot” coding for categorical (aka nominal, discrete) features. Categorical features are “attribute-value” pairs where the value is restricted to a list of discrete of possibilities without ordering (e.g. topic identifiers, types of objects, tags, names…).

In [1]:
measurements = [
    {'city': 'Dubai', 'temperature': 33.},
    {'city': 'London', 'temperature': 12.},
     {'city': 'San Francisco', 'temperature': 18.},
]

In [3]:
from sklearn.feature_extraction import DictVectorizer
dvec = DictVectorizer()

In [4]:
## Learn a list of feature name -> indices mappings and transform X.
X_train_dic_counts = dvec.fit_transform(measurements)
print(X_train_dic_counts)

  (0, 0)	1.0
  (0, 3)	33.0
  (1, 1)	1.0
  (1, 3)	12.0
  (2, 2)	1.0
  (2, 3)	18.0


In [5]:
X_train_dic_counts.todense()

matrix([[ 1.,  0.,  0., 33.],
        [ 0.,  1.,  0., 12.],
        [ 0.,  0.,  1., 18.]])

In [6]:
dvec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

In [7]:
dvec.inverse_transform(X_train_dic_counts)

[{'city=Dubai': 1.0, 'temperature': 33.0},
 {'city=London': 1.0, 'temperature': 12.0},
 {'city=San Francisco': 1.0, 'temperature': 18.0}]

In [12]:
## Transform feature->value dicts to array or sparse matrix.
## Named features not encountered during fit or fit_transform will be ignored.
dvec.transform({'temperature': 22.0, 'city=Toronto': 1.0}).todense()

matrix([[ 0.,  0.,  0., 22.]])

### Feature Selection 

In [13]:
from sklearn.feature_selection import SelectKBest, chi2
dvect = DictVectorizer()

In [14]:
X_train_dic_counts = dvec.fit_transform(measurements)
print(X_train_dic_counts.todense())

[[ 1.  0.  0. 33.]
 [ 0.  1.  0. 12.]
 [ 0.  0.  1. 18.]]


In [20]:
support = SelectKBest(chi2, k=2).fit(X_train_dic_counts, [0, 0, 1])  ##Select features according to the k highest scores.

In [25]:
support.get_support()

array([False, False,  True,  True])

In [21]:
dvec.get_feature_names()

['city=Dubai', 'city=London', 'city=San Francisco', 'temperature']

In [27]:
dvec.restrict(support.get_support())

DictVectorizer(dtype=<class 'numpy.float64'>, separator='=', sort=True,
        sparse=True)

In [28]:
dvec.get_feature_names()

['city=San Francisco', 'temperature']

## 2) Bag-of-Words Model
Reference: https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/  

We may want to perform classification of documents, so each document is an “input” and a class label is the “output” for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each

### 2.1) Word Counts with Count Vectorizer

The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

Create an instance of the CountVectorizer class.
Call the fit() function in order to learn a vocabulary from one or more documents.
Call the transform() function on one or more documents as needed to encode each as a vector.
An encoded vector is returned with a length of the entire vocabulary and an integer count for the number of times each word appeared in the document.

Because these vectors will contain a lot of zeros, we call them sparse. Python provides an efficient way of handling sparse vectors in the scipy.sparse package.

In [89]:
from sklearn.feature_extraction.text import CountVectorizer

In [114]:
# list of text documents
ps23 = ['The LORD is my shepherd, I lack nothing. He makes me lie down in green pastures, he leads me beside quiet waters, he refreshes my soul. He guides me along the right paths for his name’s sake. Even though I walk through the darkest valley,I will fear no evil, for you are with me; your rod and your staff, they comfort me. You prepare a table before me in the presence of my enemies. You anoint my head with oil; my cup overflows. Surely your goodness and love will follow me all the days of my life, and I will dwell in the house of the LORD forever.']
# create the transform
cvec = CountVectorizer()
# tokenize and build vocab
ps23_Counts = cvec.fit_transform(ps23)

In [103]:
# summarize
print(ps23_Counts.shape)
print(cvec.vocabulary_) ## This lists all the words and the corresponding indices. 
print(cvec.get_feature_names()) ## This lists the list of words aka bag of words in alphabetical order.

(1, 69)
{'the': 58, 'lord': 33, 'is': 28, 'my': 37, 'shepherd': 53, 'lack': 29, 'nothing': 40, 'he': 23, 'makes': 35, 'me': 36, 'lie': 31, 'down': 11, 'in': 27, 'green': 21, 'pastures': 44, 'leads': 30, 'beside': 6, 'quiet': 48, 'waters': 64, 'refreshes': 49, 'soul': 54, 'guides': 22, 'along': 1, 'right': 50, 'paths': 45, 'for': 18, 'his': 25, 'name': 38, 'sake': 52, 'even': 14, 'though': 60, 'walk': 63, 'through': 61, 'darkest': 9, 'valley': 62, 'will': 65, 'fear': 16, 'no': 39, 'evil': 15, 'you': 67, 'are': 4, 'with': 66, 'your': 68, 'rod': 51, 'and': 2, 'staff': 55, 'they': 59, 'comfort': 7, 'prepare': 46, 'table': 57, 'before': 5, 'presence': 47, 'of': 41, 'enemies': 13, 'anoint': 3, 'head': 24, 'oil': 42, 'cup': 8, 'overflows': 43, 'surely': 56, 'goodness': 20, 'love': 34, 'follow': 17, 'all': 0, 'days': 10, 'life': 32, 'dwell': 12, 'house': 26, 'forever': 19}
['all', 'along', 'and', 'anoint', 'are', 'before', 'beside', 'comfort', 'cup', 'darkest', 'days', 'down', 'dwell', 'enemie

In [104]:
# encode document
print(type(cvec))
print(cvec.get_feature_names())
print(ps23_Counts.todense())

<class 'sklearn.feature_extraction.text.CountVectorizer'>
['all', 'along', 'and', 'anoint', 'are', 'before', 'beside', 'comfort', 'cup', 'darkest', 'days', 'down', 'dwell', 'enemies', 'even', 'evil', 'fear', 'follow', 'for', 'forever', 'goodness', 'green', 'guides', 'he', 'head', 'his', 'house', 'in', 'is', 'lack', 'leads', 'lie', 'life', 'lord', 'love', 'makes', 'me', 'my', 'name', 'no', 'nothing', 'of', 'oil', 'overflows', 'pastures', 'paths', 'prepare', 'presence', 'quiet', 'refreshes', 'right', 'rod', 'sake', 'shepherd', 'soul', 'staff', 'surely', 'table', 'the', 'they', 'though', 'through', 'valley', 'walk', 'waters', 'will', 'with', 'you', 'your']
[[1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 4 1 1 1 3 1 1 1 1 1 2 1 1
  7 6 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 1 1 1 1 1 1 3 2 3 3]]


In [105]:
type(ps23_Counts.todense())

numpy.matrixlib.defmatrix.matrix

In [106]:
import numpy as np
np.max(ps23_Counts.todense())

7

In [107]:
np.where(ps23_Counts.todense() == np.max(ps23_Counts.todense()))

(array([0, 0], dtype=int64), array([36, 58], dtype=int64))

In [108]:
cvec.get_feature_names()[36], cvec.get_feature_names()[58]

('me', 'the')

#### Encode other documents that contain words that are not included in the vocabulary

In [109]:
ps34 = ['I will extol the LORD at all times; his praise will always be on my lips. I will glory in the LORD; let the afflicted hear and rejoice. Glorify the LORD with me; let us exalt his name together. I sought the LORD, and he answered me; he delivered me from all my fears. 5 Those who look to him are radiant; their faces are never covered with shame. 6 This poor man called, and the LORD heard him; he saved him out of all his troubles.']

In [115]:
ps34_Counts = cvec.transform(ps34)

In [116]:
print(cvec.get_feature_names())
print(ps34_Counts.todense())

['all', 'along', 'and', 'anoint', 'are', 'before', 'beside', 'comfort', 'cup', 'darkest', 'days', 'down', 'dwell', 'enemies', 'even', 'evil', 'fear', 'follow', 'for', 'forever', 'goodness', 'green', 'guides', 'he', 'head', 'his', 'house', 'in', 'is', 'lack', 'leads', 'lie', 'life', 'lord', 'love', 'makes', 'me', 'my', 'name', 'no', 'nothing', 'of', 'oil', 'overflows', 'pastures', 'paths', 'prepare', 'presence', 'quiet', 'refreshes', 'right', 'rod', 'sake', 'shepherd', 'soul', 'staff', 'surely', 'table', 'the', 'they', 'though', 'through', 'valley', 'walk', 'waters', 'will', 'with', 'you', 'your']
[[3 0 3 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 3 0 1 0 0 0 0 0 5 0 0
  3 2 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 3 2 0 0]]


### fit_transform() vs transform() 
- fit_transform(raw_documents[, y])	Learn the vocabulary dictionary and return term-document matrix. When we do fit_transform(), the vocabulary is learn from the document and document matrix is returned. In our previous case, we want the system to learn the vocab from the Ps23 text and create the term document matrix for ps34. In this case, we cannot use fit_transform(). Hence, we use just transform(). Also fit_transform() is the combined step of fit() and transform().  
- fit(raw_documents[, y])	Learn a vocabulary dictionary of all tokens in the raw documents.  
- transform(raw_documents)	Transform documents to document-term matrix.  

### Bigram vectorizer

In [154]:
ps23 = ['ps23 is cool', 'ps34 is super cool']
bigramvect = CountVectorizer(ngram_range=(1,2), token_pattern= r'\b\w+\b', min_df=1)
ps23_bi = bigramvect.fit_transform(ps23).toarray()

In [155]:
ps23_bi

array([[1, 1, 1, 0, 1, 1, 0, 0, 0, 0],
       [1, 1, 0, 1, 0, 0, 1, 1, 1, 1]], dtype=int64)

In [156]:
ps23_bi.shape

(2, 10)

In [157]:
print(bigramvect.vocabulary_)

{'ps23': 4, 'is': 1, 'cool': 0, 'ps23 is': 5, 'is cool': 2, 'ps34': 6, 'super': 8, 'ps34 is': 7, 'is super': 3, 'super cool': 9}


In [158]:
print(bigramvect.get_feature_names())

['cool', 'is', 'is cool', 'is super', 'ps23', 'ps23 is', 'ps34', 'ps34 is', 'super', 'super cool']


In [159]:
bigramvect.vocabulary_.get('is cool')

2

In [160]:
ps23_bi[:, 2]

array([1, 0], dtype=int64)

## 2.2) Word Frequencies with TfidfVectorizer
Issue with word counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

In [161]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [162]:
tfvec = TfidfVectorizer()

In [181]:
text = ["The quick brown fox jumped over the lazy dog.", 
        "The dog.",
        "The fox"]
text_tf = tfvec.fit_transform(text)

In [182]:
text_tf.shape

(3, 8)

In [183]:
text_tf.toarray()

array([[0.36388646, 0.27674503, 0.27674503, 0.36388646, 0.36388646,
        0.36388646, 0.36388646, 0.42983441],
       [0.        , 0.78980693, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.61335554],
       [0.        , 0.        , 0.78980693, 0.        , 0.        ,
        0.        , 0.        , 0.61335554]])

In [184]:
print(tfvec.vocabulary_)

{'the': 7, 'quick': 6, 'brown': 0, 'fox': 2, 'jumped': 3, 'over': 5, 'lazy': 4, 'dog': 1}


In [187]:
print(tfvec.idf_)

[1.69314718 1.28768207 1.28768207 1.69314718 1.69314718 1.69314718
 1.69314718 1.        ]


In [186]:
text_tf.toarray()

array([[0.36388646, 0.27674503, 0.27674503, 0.36388646, 0.36388646,
        0.36388646, 0.36388646, 0.42983441],
       [0.        , 0.78980693, 0.        , 0.        , 0.        ,
        0.        , 0.        , 0.61335554],
       [0.        , 0.        , 0.78980693, 0.        , 0.        ,
        0.        , 0.        , 0.61335554]])

The scores are normalized to values between 0 and 1 and the encoded document vectors can then be used directly with most machine learning algorithms.

## 2.3) Hashing with HashingVectorizer
Counts and frequencies can be very useful, but one limitation of these methods is that the vocabulary can become very large.

This, in turn, will require large vectors for encoding documents and impose large requirements on memory and slow down algorithms.

A clever work around is to use a one way hash of words to convert them to integers. The clever part is that no vocabulary is required and you can choose an arbitrary-long fixed length vector. A downside is that the hash is a one-way function so there is no way to convert the encoding back to a word (which may not matter for many supervised learning tasks).

The HashingVectorizer class implements this approach that can be used to consistently hash words, then tokenize and encode documents as needed.

In [188]:
from sklearn.feature_extraction.text import HashingVectorizer

In [189]:
# text document
text = ['The quick brown fox jumped over the lazy dog']

In [190]:
##Convert a collection of text documents to a matrix of token occurrences
hvec = HashingVectorizer(n_features=20)

In [196]:
text_hv = hvec.fit_transform(text).toarray()

In [198]:
text_hv

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.33333333,  0.        , -0.33333333,  0.33333333,  0.        ,
         0.        ,  0.33333333,  0.        ,  0.        ,  0.        ,
        -0.33333333,  0.        ,  0.        , -0.66666667,  0.        ]])

## 3) Examples - Bag of words

In [200]:
vocab1 = open('102welcome.txt').read().replace('\n', '')
vocab2 = open('103welcome.txt').read().replace('\n', '')

In [201]:
vocab1

'welcome to Big Data Analytics. My name is Ziyad Al-Khinali and I will be your instructor for this course. To learn more about my background, check out Meet Your Instructor, under the About the Course module. This course will have a lab coach available to assist you with the lab portion of this course and any software and application questions that you may have.  To contact the Lab Coach, use the lab coach discussion forum.  More details will follow after the start of the course.  Before delving into the course content and learning activities, it is important to be aware of the resources that will be available to you throughout the term. Use these resources to build your learning and time management strategies, so that you can meet the course expectations.Review the Getting Started module to:learn how to navigate your course (A2L Guide)learn how to avoid falling into plagiarism (Academic Integrity Guide)identify library and research resources (Library, Research and Writing Assistance)R

In [202]:
train_set = [vocab1, vocab2]
print(train_set)

['welcome to Big Data Analytics. My name is Ziyad Al-Khinali and I will be your instructor for this course. To learn more about my background, check out Meet Your Instructor, under the About the Course module. This course will have a lab coach available to assist you with the lab portion of this course and any software and application questions that you may have.  To contact the Lab Coach, use the lab coach discussion forum.  More details will follow after the start of the course.  Before delving into the course content and learning activities, it is important to be aware of the resources that will be available to you throughout the term. Use these resources to build your learning and time management strategies, so that you can meet the course expectations.Review the Getting Started module to:learn how to navigate your course (A2L Guide)learn how to avoid falling into plagiarism (Academic Integrity Guide)identify library and research resources (Library, Research and Writing Assistance)

## Tokenizing text with scikit-learn

In [203]:
from sklearn.feature_extraction.text import CountVectorizer
vec = CountVectorizer(stop_words='english') ###Convert a collection of text documents to a matrix of token counts
X_train_counts = vec.fit_transform(train_set) ### Term document matrix
print(X_train_counts.toarray()) ### Return a dense matrix representation of this matrix

[[ 0  1  1  1  1  0  1  1  1  1  1  0  2  1  1  1  0  1  1  1  1  1  3  1
   1  2  1 12  1  1  1  1  1  1  2  0  1  1  1  1  1  1  1  2  1  1  1  2
   1  1  1  1  1  2  1  1  1  0  1  1  4  4  3  2  1  1  1  2  1  3  1  3
   1  1  1  1  1  1  1  1  1  2  2  3  2  1  0  0  1  1  1  1  1  2  2  3
   2  1  1  2]
 [ 1  1  1  1  1  1  0  0  1  0  1  1  1  1  1  2  1  0  1  1  2  1  3  1
   1  1  1 10  1  1  1  1  0  1  1  2  1  1  1  1  1  1  0  1  1  0  1  2
   1  1  1  1  1  2  1  1  1  1  1  0  4  5  3  2  1  1  2  0  1  4  1  4
   1  1  1  1  1  1  1  1  1  0  2  3  2  1  1  1  1  0  1  1  1  2  2  3
   1  2  1  0]]


In [204]:
print(X_train_counts.shape) ##There are 2 documents and 100 words

(2, 100)


In [205]:
print(vec.get_feature_names()) ## Array mapping from feature integer indices to feature name
## This lists the list of words aka bag of words in alphabetical order.

['103', 'a2l', 'academic', 'activities', 'advantage', 'akash', 'al', 'analytics', 'application', 'assist', 'assistance', 'assisting', 'available', 'avoid', 'aware', 'background', 'bda', 'big', 'build', 'cce', 'check', 'class', 'coach', 'community', 'connect', 'contact', 'content', 'course', 'data', 'dates', 'delving', 'detailed', 'details', 'discuss', 'discussion', 'eleanor', 'encourage', 'exchange', 'expectations', 'experience', 'experiences', 'falling', 'follow', 'forum', 'forward', 'gather', 'getting', 'guide', 'guiding', 'help', 'identify', 'important', 'information', 'instructor', 'integrity', 'interests', 'introduce', 'introductions', 'iâ', 'khinali', 'lab', 'learn', 'learning', 'library', 'looking', 'make', 'management', 'materials', 'mcmasterâ', 'meet', 'meeting', 'module', 'navigate', 'need', 'networking', 'opportunity', 'organize', 'plagiarism', 'portion', 'questions', 'reach', 'required', 'research', 'resources', 'review', 'schedule', 'shetty', 'smith', 'software', 'start', 

In [206]:
print(vec.vocabulary_) ## vocabulary_ : dict -> A mapping of terms to feature indices.
## This lists all the words and the corresponding indices. For instance welcome has the index 97 
## Note: there are 100 total words

{'welcome': 97, 'big': 17, 'data': 28, 'analytics': 7, 'ziyad': 99, 'al': 6, 'khinali': 59, 'instructor': 53, 'course': 27, 'learn': 61, 'background': 15, 'check': 20, 'meet': 69, 'module': 71, 'lab': 60, 'coach': 22, 'available': 12, 'assist': 9, 'portion': 78, 'software': 88, 'application': 8, 'questions': 79, 'contact': 25, 'use': 96, 'discussion': 34, 'forum': 43, 'details': 32, 'follow': 42, 'start': 89, 'delving': 30, 'content': 26, 'learning': 62, 'activities': 3, 'important': 51, 'aware': 14, 'resources': 83, 'term': 94, 'build': 18, 'time': 95, 'management': 66, 'strategies': 91, 'expectations': 38, 'review': 84, 'getting': 46, 'started': 90, 'navigate': 72, 'a2l': 1, 'guide': 47, 'avoid': 13, 'falling': 41, 'plagiarism': 77, 'academic': 2, 'integrity': 54, 'identify': 50, 'library': 63, 'research': 82, 'writing': 98, 'assistance': 10, 'organize': 76, 'study': 93, 'dates': 29, 'detailed': 31, 'schedule': 85, 'gather': 45, 'required': 81, 'materials': 67, 'reach': 80, 'need': 7

In [207]:
len(vec.get_feature_names()), len(vec.vocabulary_)

(100, 100)

In [208]:
test_set = train_set

## We are trying to  

In [209]:
X_test_counts = vec.transform(test_set)
print(X_test_counts.toarray())

[[ 0  1  1  1  1  0  1  1  1  1  1  0  2  1  1  1  0  1  1  1  1  1  3  1
   1  2  1 12  1  1  1  1  1  1  2  0  1  1  1  1  1  1  1  2  1  1  1  2
   1  1  1  1  1  2  1  1  1  0  1  1  4  4  3  2  1  1  1  2  1  3  1  3
   1  1  1  1  1  1  1  1  1  2  2  3  2  1  0  0  1  1  1  1  1  2  2  3
   2  1  1  2]
 [ 1  1  1  1  1  1  0  0  1  0  1  1  1  1  1  2  1  0  1  1  2  1  3  1
   1  1  1 10  1  1  1  1  0  1  1  2  1  1  1  1  1  1  0  1  1  0  1  2
   1  1  1  1  1  2  1  1  1  1  1  0  4  5  3  2  1  1  2  0  1  4  1  4
   1  1  1  1  1  1  1  1  1  0  2  3  2  1  1  1  1  0  1  1  1  2  2  3
   1  2  1  0]]


### Sum 2 rows and build a frequency table 

In [210]:
import pandas as pd
df = pd.DataFrame(X_test_counts.toarray(), columns=vec.get_feature_names())
df.head()

Unnamed: 0,103,a2l,academic,activities,advantage,akash,al,analytics,application,assist,...,started,strategies,student,study,term,time,use,welcome,writing,ziyad
0,0,1,1,1,1,0,1,1,1,1,...,1,1,1,2,2,3,2,1,1,2
1,1,1,1,1,1,1,0,0,1,0,...,1,1,1,2,2,3,1,2,1,0


In [214]:
sumdf = df.sum(axis=0)

In [215]:
type(sumdf)

pandas.core.series.Series

In [216]:
pd.DataFrame(sumdf)

Unnamed: 0,0
103,1
a2l,2
academic,2
activities,2
advantage,2
akash,1
al,1
analytics,1
application,2
assist,1


In [218]:
df2 = pd.DataFrame({'Vocab': sumdf.index, 'Frequency': sumdf.values})
df2

Unnamed: 0,Vocab,Frequency
0,103,1
1,a2l,2
2,academic,2
3,activities,2
4,advantage,2
5,akash,1
6,al,1
7,analytics,1
8,application,2
9,assist,1


In [219]:
train1 = [vocab1]
test1 = [vocab2]

In [221]:
cvec = CountVectorizer(stop_words='english')
X_train1_counts = cvec.fit_transform(train1)
print(X_train1_counts.toarray())

[[ 1  1  1  1  1  1  1  1  1  2  1  1  1  1  1  1  1  1  3  1  1  2  1 12
   1  1  1  1  1  1  2  1  1  1  1  1  1  1  2  1  1  1  2  1  1  1  1  1
   2  1  1  1  1  1  4  4  3  2  1  1  1  2  1  3  1  3  1  1  1  1  1  1
   1  1  1  2  2  3  2  1  1  1  1  1  1  2  2  3  2  1  1  2]]


In [222]:
print(vec.vocabulary_)

{'welcome': 97, 'big': 17, 'data': 28, 'analytics': 7, 'ziyad': 99, 'al': 6, 'khinali': 59, 'instructor': 53, 'course': 27, 'learn': 61, 'background': 15, 'check': 20, 'meet': 69, 'module': 71, 'lab': 60, 'coach': 22, 'available': 12, 'assist': 9, 'portion': 78, 'software': 88, 'application': 8, 'questions': 79, 'contact': 25, 'use': 96, 'discussion': 34, 'forum': 43, 'details': 32, 'follow': 42, 'start': 89, 'delving': 30, 'content': 26, 'learning': 62, 'activities': 3, 'important': 51, 'aware': 14, 'resources': 83, 'term': 94, 'build': 18, 'time': 95, 'management': 66, 'strategies': 91, 'expectations': 38, 'review': 84, 'getting': 46, 'started': 90, 'navigate': 72, 'a2l': 1, 'guide': 47, 'avoid': 13, 'falling': 41, 'plagiarism': 77, 'academic': 2, 'integrity': 54, 'identify': 50, 'library': 63, 'research': 82, 'writing': 98, 'assistance': 10, 'organize': 76, 'study': 93, 'dates': 29, 'detailed': 31, 'schedule': 85, 'gather': 45, 'required': 81, 'materials': 67, 'reach': 80, 'need': 7

In [224]:
print(vec.get_feature_names())

['103', 'a2l', 'academic', 'activities', 'advantage', 'akash', 'al', 'analytics', 'application', 'assist', 'assistance', 'assisting', 'available', 'avoid', 'aware', 'background', 'bda', 'big', 'build', 'cce', 'check', 'class', 'coach', 'community', 'connect', 'contact', 'content', 'course', 'data', 'dates', 'delving', 'detailed', 'details', 'discuss', 'discussion', 'eleanor', 'encourage', 'exchange', 'expectations', 'experience', 'experiences', 'falling', 'follow', 'forum', 'forward', 'gather', 'getting', 'guide', 'guiding', 'help', 'identify', 'important', 'information', 'instructor', 'integrity', 'interests', 'introduce', 'introductions', 'iâ', 'khinali', 'lab', 'learn', 'learning', 'library', 'looking', 'make', 'management', 'materials', 'mcmasterâ', 'meet', 'meeting', 'module', 'navigate', 'need', 'networking', 'opportunity', 'organize', 'plagiarism', 'portion', 'questions', 'reach', 'required', 'research', 'resources', 'review', 'schedule', 'shetty', 'smith', 'software', 'start', 

In [225]:
X_test1_counts = vec.transform(test1)
print(X_test1_counts.toarray())

[[ 1  1  1  1  1  1  0  0  1  0  1  1  1  1  1  2  1  0  1  1  2  1  3  1
   1  1  1 10  1  1  1  1  0  1  1  2  1  1  1  1  1  1  0  1  1  0  1  2
   1  1  1  1  1  2  1  1  1  1  1  0  4  5  3  2  1  1  2  0  1  4  1  4
   1  1  1  1  1  1  1  1  1  0  2  3  2  1  1  1  1  0  1  1  1  2  2  3
   1  2  1  0]]
