## NLP Components
1. Lemmatization: Lemmatization is the process of reducing words to their base or dictionary form to normalize them across different forms. 
- Example: The lemma of the word "running" is "run".

2. Token: A token is a single unit of text obtained after splitting a sentence or document based on certain criteria, such as whitespace or punctuation.
- Example: In the sentence "I love programming", the tokens are "I", "love", and "programming".

3. Stemming: Stemming is the process of reducing words to their root or base form by removing suffixes, which allows different variations of the same word to be treated as the same word.
- Example: The stem of the words "running", "runs", and "runner" is "run".

4. Stopwords: Stopwords are common words that are often filtered out during text processing because they typically do not carry significant meaning.
- Example: In English, stopwords can include words like "the", "is", "and", etc.

5. N-gram: An n-gram is a contiguous sequence of n items (words or characters) in a sentence or document.
- Example: In the sentence "The quick brown fox", examples of n-grams are unigrams ("The", "quick", "brown", "fox"), bigrams ("The quick", "quick brown", "brown fox"), and trigrams ("The quick brown", "quick brown fox").

6. TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF is a numerical statistic used to evaluate the importance of a word in a document relative to a collection of documents.
- Example: In a document containing multiple sentences, the TF-IDF score of a word is higher if it appears frequently in that document but rarely in other documents in the collection.
* $ W_{x,y} = {tf}_{x,y} * \log({N \over df_x} ) $
* tf = frequency of x in y
* df = number of documents containing x
* N = total number of documents

7. Tokenizer: A tokenizer is a tool used to break down a text into smaller units, such as words, phrases, or sentences, based on specific rules.
- Example: The NLTK library in Python provides various tokenizers, such as word tokenizers and sentence tokenizers, which can be used to tokenize text data.




In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
import re
from nltk.stem.porter import PorterStemmer


In [2]:
df = pd.read_csv('CNN_Articels_clean.csv')

In [3]:
df

Unnamed: 0,Date published,Category,Section,Article
0,7/15/21 2:46,news,world,"There's a shortage of truckers, but TuSimple t..."
1,5/12/21 7:52,news,world,Bioservo's robotic 'Ironhand' could protect fa...
2,6/16/21 2:51,news,asia,This swarm of robots gets smarter the more it ...
3,3/15/22 9:57,business,investing,Russia is no longer an option for investors. T...
4,3/15/22 11:27,business,business,Russian energy investment ban part of new EU s...
...,...,...,...,...
4071,12/1/21 10:01,sport,tennis,Australian Open: Australia's vaccine mandate i...
4072,12/1/21 17:56,sport,golf,Four golfers test positive ahead of South Afri...
4073,12/1/21 11:32,sport,tennis,Peng Shuai: 'Unanimous conclusion' that tennis...
4074,12/1/21 17:27,news,europe,"This company is ""zapping"" cow dung with lightn..."


In [9]:
nltk.download()



showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml




True

In [10]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')

[nltk_data] Error loading punkt: <urlopen error [Errno 54] Connection
[nltk_data]     reset by peer>
[nltk_data] Error loading stopwords: <urlopen error [Errno 54]
[nltk_data]     Connection reset by peer>


False

In [5]:
list_art = df['Article'].to_list()
list_cat = df['Category'].to_list()
list_sec = df['Section'].to_list()


In [6]:
# Define a function to preprocess the text
def preprocess_text(text):
    # Remove punctuation and other non-alphanumeric characters
    text =  re.sub('[^a-zA-Z]', ' ', text)
    # Tokenize the text into words
    words = word_tokenize(text.lower())
    # Remove stop words
    words = [word for word in words if word not in stopwords.words('english')]
    # Join the words back into a string
    return ' '.join(words)

# Preprocess the corpus
corpus = [preprocess_text(doc) for doc in list_art]
print('Corpus: \n{}'.format(corpus))

# Create a TfidfVectorizer object and fit it to the preprocessed corpus
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)

# Transform the preprocessed corpus into a TF-IDF matrix
tf_idf_matrix = vectorizer.transform(corpus)

# Get list of feature names that correspond to the columns in the TF-IDF matrix
print("Feature Names:\n", vectorizer.get_feature_names_out())

# Print the resulting matrix
print("TF-IDF Matrix:\n",tf_idf_matrix.toarray())

NameError: name 'word_tokenize' is not defined