# <font color="orange">Text Mining</font>

<h3><strong class="ba">Introduction to TF-IDF</strong></h3>    
<p>Computers do not process textual data to greater extent, unlike the numbers. One of the most widely used technique to process textual data is Term Frequency-Inverse Data Frequency (TF-IDF) . In this article, we will discuss about what are it’s features and how it works!</p>

<p>From our intuition, we think the words which appear most are significant and carry greater weight but that’s not the case in textual data analysis. The words such as “the”, “is”, “a”, “an” are called “Stop-words” and they appear most on corpus of text, but carry little significance and weight in textual data analysis. Instead, the words which are rare are the ones that actually help in distinguishing between data and carry more significance and weight.</p>

<h4><strong class="ba">Term Frequency (tf):</strong></h4>
<p>Term Frequency gives the frequency of the word in each document in the corpus. The ratio of number of times the word appears in a document compared to the total number of words in that document. It increases as the number of occurrences of that word within the document increases. Each document has its own tf. The equation for tf is given below.</p>

<h4><strong class="ba">Inverse Data Frequency (idf):</strong></h4>
<p>It calculates the weight of rare words across all documents in the corpus. The words which occur rarely in the corpus will have a high idf score. The equation of idf is given below.</p>

<a href="https://medium.com/@imayan_blog/text-data-mining-using-term-frequency-inverse-data-frequency-tf-idf-aedd6f1d0b38">read more...</a>
###  <font color="brown">Text preprocessing:</font>
<ul>
    <li>remove stubs words</li>
    <li>stemming  <a href="https://towardsdatascience.com/stemming-of-words-in-natural-language-processing-what-is-it-41a33e8996e2#:~:text=To%20put%20simply%2C%20stemming%20is,to%20chop%20a%20word%20off.">read more...</a></li>
    <li>lemmatizer <a href="https://github.com/sobhe/hazm">read more...</a></li>
<ul>

In [1]:
import pandas as pd
import numpy as np
import nltk
import hazm
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix,classification_report

In [3]:
DT = pd.read_csv('../../../../datasets/per.csv')

In [4]:
DT.head()

Unnamed: 0,NewsID,Title,Body,Date,Time,Category,Category2
0,843656,\nوزير علوم درجمع استادان نمونه: سن بازنشستگي ...,\nوزير علوم در جمع استادان نمونه كشور گفت: از ...,\n138/5//09,\n0:9::18,\nآموزشي-,\nآموزشي
1,837144,\nگردهمايي دانش‌آموختگان موسسه آموزش عالي سوره...,\nبه گزارش سرويس صنفي آموزشي خبرگزاري دانشجويا...,\n138/5//09,\n1:4::11,\nآموزشي-,\nآموزشي
2,436862,\nنتايج آزمون دوره‌هاي فراگير دانشگاه پيام‌نور...,\nنتايج آزمون دوره‌هاي فراگير مقاطع كارشناسي و...,\n138/3//07,\n1:0::03,\nآموزشي-,\nآموزشي
3,227781,\nهمايش يكروزه آسيب شناسي مفهوم روابط عمومي در...,\n,\n138/2//02,\n1:3::42,\nاجتماعي-خانواده-,\nاجتماعي
4,174187,\nوضعيت اقتصادي و ميزان تحصيلات والدين از مهمت...,\nمحمدتقي علوي يزدي، مجري اين طرح پژوهشي در اي...,\n138/1//08,\n1:1::49,\nآموزشي-,\nآموزشي


In [6]:
with open('../../../../datasets/stopwords.txt') as stopwords_file:
    stopwords = stopwords_file.readlines()
stopwords = [w.strip() for w in stopwords]
len(stopwords)

1316

In [7]:
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [8]:
nltk_en_stopwords = nltk.corpus.stopwords.words('english')
nltk_ar_stopwords = nltk.corpus.stopwords.words('arabic')

stopwords.extend(nltk_en_stopwords)
stopwords.extend(nltk_ar_stopwords)

In [9]:
stemmer = hazm.Stemmer()
lemmatizer = hazm.Lemmatizer()

In [10]:
DT['title_body'] = DT['Title']+ ' '+ DT['Body']
DT['title_body'] = DT['title_body'].apply(
    lambda w : ' '.join( [(lemmatizer.lemmatize(stemmer.stem(w))).replace('#',' ').strip() for w in hazm.word_tokenize(w) if w not in stopwords] )
)

In [11]:
DT['category'] = DT['Category2'].str.strip()
DF = DT[['title_body','category']]

In [39]:
# DF.to_csv('./cleaned_news.csv',encoding='utf-8')
# DF = pd.read_csv('./cleaned_news.csv')

In [12]:
DF.iloc[:,1] = DF['category'].apply(lambda w : w.replace('-',' ').replace('\u200c',' ').strip())

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  DF.iloc[:,1] = DF['category'].apply(lambda w : w.replace('-',' ').replace('\u200c',' ').strip())


In [13]:
vectorizer = TfidfVectorizer(ngram_range=(1,3))

In [14]:
vectorizer.fit(DF['title_body'])
X = vectorizer.transform(DF['title_body'])

In [19]:
np.random.seed(3020) # set random state in numpy

In [15]:
X

<10999x3864455 sparse matrix of type '<class 'numpy.float64'>'
	with 7952922 stored elements in Compressed Sparse Row format>

In [18]:
X[:100,:100]

<100x100 sparse matrix of type '<class 'numpy.float64'>'
	with 0 stored elements in Compressed Sparse Row format>

In [None]:
X[:100,:100].todense()

In [76]:
label_encoder = LabelEncoder()
Y = label_encoder.fit_transform(DF['category'])

In [77]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=42)

In [78]:
svmc = SVC()

In [None]:
svmc.fit(X_train,y_train)

In [None]:
accuracy = svmc.score(X_test,y_test)
accuracy

In [None]:
y_predict = svmc.predict(X_test)

In [None]:
report = metrics.classification_report(y_test,y_predict)
print(report)

In [None]:
print(confusion_matrix(y_test,y_predict))