A vector is a data structure that is similar to a list or an array. number of values represent the vector's dimenionality (min no of coordinates to specify a point in a space) - dimension, axis, shape represent same concept.

In [1]:
import numpy as np
a = np.array([[1,2],[3,4]])
a.shape

(2, 2)

Statistical approach for text classification **n-gram** - classify between texts that talk about cricket or football

Steps:
1. Text Vectorization.
2. Train regression model on vectorized texts ( find the relationships between two data factors)
3. Check importance of each word.

In [2]:
#Text Vectorization - can be done in multiple ways but we will count the number of occurrences of each word in the texts

#easily show tables / data frames
import pandas as pd

#vectorize texts by counting the occurrencies of each word (bow)
from sklearn.feature_extraction.text import CountVectorizer

from sklearn.linear_model import LogisticRegression

In [3]:
texts = [
    "cricket is a team sport which is played on a 22 yard pitch",
    "football is a sport where teams score goals"
]
labels = [1, 0] # 1 means cricket, 0 means football

# fit vectorizer on texts
vectorizer = CountVectorizer(ngram_range=(1, 1)) #must consider only single words (unigrams)
vectorizer.fit(texts) # build ngram dictionary

# ransform text into vectors with the transform method
ngrams = vectorizer.transform(texts)
ngrams.todense() #dense matrix for less memory

matrix([[1, 1, 0, 0, 2, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1],
        [0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0]])

In [4]:
#which word corresponds to which column?

# show the vocabulary learned by the vectorizer
vectorizer.vocabulary_

{'cricket': 1,
 'is': 4,
 'team': 10,
 'sport': 9,
 'which': 13,
 'played': 7,
 'on': 5,
 '22': 0,
 'yard': 14,
 'pitch': 6,
 'football': 2,
 'where': 12,
 'teams': 11,
 'score': 8,
 'goals': 3}

In [5]:
# create a pandas dataframe that shows the unigrams in each text

keys_values_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1])
keys_sorted = list(zip(*keys_values_sorted))[0]
ngrams_matrix = ngrams.todense()
df = pd.DataFrame(ngrams_matrix, columns=keys_sorted)
df

Unnamed: 0,22,cricket,football,goals,is,on,pitch,played,score,sport,team,teams,where,which,yard
0,1,1,0,0,2,1,1,1,0,1,1,0,0,1,1
1,0,0,1,1,1,0,0,0,1,1,0,1,1,0,0


In [8]:
# train logistic regression on unigrams
model = LogisticRegression()
model.fit(ngrams, labels)

# show logistic regression weights - weights decide the importance of a feature
from_unigram_to_weight = dict(zip(keys_sorted, model.coef_[0]))
from_unigram_to_weight

{'22': 0.19896903220286755,
 'cricket': 0.19896903220286755,
 'football': -0.19896470102742486,
 'goals': -0.19896470102742486,
 'is': 0.1989733633783102,
 'on': 0.19896903220286755,
 'pitch': 0.19896903220286755,
 'played': 0.19896903220286755,
 'score': -0.19896470102742486,
 'sport': 4.331175442672891e-06,
 'team': 0.19896903220286755,
 'teams': -0.19896470102742486,
 'where': -0.19896470102742486,
 'which': 0.19896903220286755,
 'yard': 0.19896903220286755}

We may want to consider “team” and “teams” the same
. This normalization step is typically done by **stemming** or **lemmatization**.

In [17]:
#Stemming - reducing words to base or root form (cats -> cat, achieve -> achiev)

import nltk

#run only once
#nltk.download('punkt')

from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer #performs suffix stripping

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [10]:
stemmer = PorterStemmer()

In [11]:
print(stemmer.stem("cat"))
print(stemmer.stem("cats"))

print(stemmer.stem("walking"))
print(stemmer.stem("walked"))

print(stemmer.stem("achieve"))

print(stemmer.stem("am"))
print(stemmer.stem("is"))
print(stemmer.stem("are"))

cat
cat
walk
walk
achiev
am
is
are


In [12]:
text = "football is a sport where teams score goals"

tokens = word_tokenize(text)
tokens_stemmed = [stemmer.stem(token) for token in tokens]
print(tokens_stemmed)
# ['the', 'cat', 'are', 'sleep', '.', 'what', 'are', 'the', 'dog', 'do', '?']

['footbal', 'is', 'a', 'sport', 'where', 'team', 'score', 'goal']


Stemming can reduce words that are non existing, operates on single word without knowledge of context. Lemmatization finds the correct lemma and understands the context.

In [16]:
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()

#run only once
#nltk.download('wordnet')


print(lemmatizer.lemmatize("achieve"))

achieve


In [18]:
#run once
#nltk.download('stopwords')

from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
print(f"There are {len(english_stopwords)} stopwords in English")
print(english_stopwords[:10])

There are 179 stopwords in English
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**POS Tagging** - every word in a text is assigned a category. It is used to understanf the grammatical structure of a sentence and to clarify words that have multiple meanings.

In [23]:
#run once
#nltk.download('averaged_perceptron_tagger')

text = word_tokenize("They refuse to play football")
print(nltk.pos_tag(text))

text = word_tokenize("We need the refuse the permit")
print(nltk.pos_tag(text))


[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('play', 'VB'), ('football', 'NN')]
[('We', 'PRP'), ('need', 'VBP'), ('the', 'DT'), ('refuse', 'NN'), ('the', 'DT'), ('permit', 'NN')]




```
Tag	Meaning	English Examples
ADJ	adjective	new, good, high, special, big, local
ADP	adposition	on, of, at, with, by, into, under
ADV	adverb	really, already, still, early, now
CONJ	conjunction	and, or, but, if, while, although
DET	determiner, article	the, a, some, most, every, no, which
NOUN	noun	year, home, costs, time, Africa
NUM	numeral	twenty-four, fourth, 1991, 14:24
PRT	particle	at, on, out, over per, that, up, with
PRON	pronoun	he, their, her, its, my, I, us
VERB	verb	is, say, told, given, playing, would
.	punctuation marks	. , ; !
X	other	ersatz, esprit, dunno, gr8, univeristy```


https://www.nltk.org/book/ch05.html


#EXAMPLE - Classify Medium Articles


In [None]:
!pip install datasets


In [25]:
from huggingface_hub import hf_hub_download

import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report

In [26]:
df_articles = pd.read_csv(
  hf_hub_download("fabiochiu/medium-articles", repo_type="dataset",
                  filename="medium_articles.csv")
)

df_articles.head()

Downloading medium_articles.csv:   0%|          | 0.00/1.04G [00:00<?, ?B/s]

Unnamed: 0,title,text,url,authors,timestamp,tags
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci..."
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P..."
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We..."
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P..."
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology..."


We’ll train a classifier to distinguish whether an article has the **Data Science** tag or not

In [27]:
# make is_data_science & full_text columns

# full_text: contains the concatenation of the title and the text of the article.
# is_data_science: a boolean which is True if the article has the "Data Science" tag


df_articles["is_data_science"] = df_articles["tags"] \
  .apply(lambda tags_list: "Data Science" in tags_list)
df_articles["full_text"] = df_articles["title"] + " " + df_articles["text"]
df_articles.head()

Unnamed: 0,title,text,url,authors,timestamp,tags,is_data_science,full_text
0,Mental Note Vol. 24,Photo by Josh Riemer on Unsplash\n\nMerry Chri...,https://medium.com/invisible-illness/mental-no...,['Ryan Fan'],2020-12-26 03:38:10.479000+00:00,"['Mental Health', 'Health', 'Psychology', 'Sci...",False,Mental Note Vol. 24 Photo by Josh Riemer on Un...
1,Your Brain On Coronavirus,Your Brain On Coronavirus\n\nA guide to the cu...,https://medium.com/age-of-awareness/how-the-pa...,['Simon Spichak'],2020-09-23 22:10:17.126000+00:00,"['Mental Health', 'Coronavirus', 'Science', 'P...",False,Your Brain On Coronavirus Your Brain On Corona...
2,Mind Your Nose,Mind Your Nose\n\nHow smell training can chang...,https://medium.com/neodotlife/mind-your-nose-f...,[],2020-10-10 20:17:37.132000+00:00,"['Biotechnology', 'Neuroscience', 'Brain', 'We...",False,Mind Your Nose Mind Your Nose\n\nHow smell tra...
3,The 4 Purposes of Dreams,Passionate about the synergy between science a...,https://medium.com/science-for-real/the-4-purp...,['Eshan Samaranayake'],2020-12-21 16:05:19.524000+00:00,"['Health', 'Neuroscience', 'Mental Health', 'P...",False,The 4 Purposes of Dreams Passionate about the ...
4,Surviving a Rod Through the Head,"You’ve heard of him, haven’t you? Phineas Gage...",https://medium.com/live-your-life-on-purpose/s...,['Rishav Sinha'],2020-02-26 00:01:01.576000+00:00,"['Brain', 'Health', 'Development', 'Psychology...",False,Surviving a Rod Through the Head You’ve heard ...


In [34]:
filtered_df = df[df['is_data_science'] == True]
filtered_df.head()

Unnamed: 0,title,text,url,authors,timestamp,tags,is_data_science,full_text
54778,Is bigger also smarter? — Open AI releases GPT...,Is bigger also smarter? — Open AI releases GPT...,https://towardsdatascience.com/is-bigger-also-...,['Andreas Stöckl'],2020-06-01 08:00:27.307000+00:00,"['Deep Learning', 'Machine Learning', 'Data Sc...",True,Is bigger also smarter? — Open AI releases GPT...
112622,Interview with a Data Scientist about Data Jou...,"Can you please just start by saying your name,...",https://medium.com/@duncan.kg.anderson/intervi...,['Duncan Anderson'],2020-10-08 21:42:28.931000+00:00,"['Journalism', 'Algorithms', 'Data Science', '...",True,Interview with a Data Scientist about Data Jou...
31053,Comparing AutoML/Non-Auto-ML Multi-Classificat...,Auto-ML approach:\n\nLet’s get the dataset:\n\...,https://medium.com/swlh/comparing-automl-non-a...,['Zeineb Ghrib'],2020-10-10 08:47:18.680000+00:00,"['Scikit Learn', 'Automl', 'Machine Learning',...",True,Comparing AutoML/Non-Auto-ML Multi-Classificat...
84657,The first step towards Data Science,"Hello guys, I am sure that you are ready to di...",https://medium.com/@deeppatel23/first-step-tow...,['Deep Patel'],2021-02-13 09:04:31.841000+00:00,"['Data Science', 'Deep Learning', 'Beginner', ...",True,The first step towards Data Science Hello guys...
183143,First Hackathon: My Experience,A month ago I participated in my first hackath...,https://medium.com/@dominika2465j/first-hackat...,['Dominika Jones'],2021-04-09 23:47:47.021000+00:00,"['Python Flask', 'Codingbootcamp', 'Hackathons...",True,First Hackathon: My Experience A month ago I p...


In [28]:
df = pd.concat([
    df_articles[df_articles["is_data_science"]].sample(n=1000),
    df_articles[~df_articles["is_data_science"]].sample(n=1000)
])

In [29]:
# train/test split
X = df[["full_text"]]
y = df["is_data_science"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=42, stratify=y)

In [None]:
#model training

# fit vectorizer, vectorize train set, and train the classification model
vectorizer = CountVectorizer(ngram_range=(1, 1))
full_texts_vectorized = vectorizer.fit_transform(X_train["full_text"])
model = LogisticRegression()
model.fit(full_texts_vectorized, y_train)

In [32]:
# vectorize test set and predict
full_texts_vectorized = vectorizer.transform(X_test["full_text"])
predictions = model.predict(full_texts_vectorized)

In [33]:
# plot precision, recall, f1-score on test set
print(classification_report(y_test, predictions))

              precision    recall  f1-score   support

       False       0.88      0.93      0.91       200
        True       0.93      0.88      0.90       200

    accuracy                           0.90       400
   macro avg       0.90      0.90      0.90       400
weighted avg       0.90      0.90      0.90       400



In [35]:
# show top 20 ngrams by logistic regression weight
ngram_indices_sorted = sorted(list(vectorizer.vocabulary_.items()), key=lambda t: t[1]) #t[1] means sorting is based on value, not key
ngram_sorted = list(zip(*ngram_indices_sorted))[0] #function grouped the elements by their indices, and make a list
ngram_weight_pairs = list(zip(ngram_sorted, model.coef_[0]))
ngram_weight_pairs_sorted = sorted(ngram_weight_pairs, key=lambda t: t[1], reverse=True)
ngram_weight_pairs_sorted[:20]

[('science', 0.8175957716912231),
 ('de', 0.713399686754437),
 ('mon', 0.5809665978761287),
 ('latest', 0.5673511138802413),
 ('missed', 0.49128494465808265),
 ('data', 0.48252635649826536),
 ('picks', 0.47546828800651497),
 ('grafiti', 0.45901842100161927),
 ('average', 0.44949147568261005),
 ('python', 0.4433265086679952),
 ('datos', 0.433267705936979),
 ('nightingale', 0.3900554729334689),
 ('questions', 0.3853747366907102),
 ('know', 0.38271029017899966),
 ('article', 0.35812592469255394),
 ('math', 0.3531043095420392),
 ('simple', 0.3388751786230124),
 ('ai', 0.3372944354535811),
 ('los', 0.3316184973899332),
 ('so', 0.3293319742659352)]