# ***DATASET INTRODUCTION***

News tends to have an immoral impact on society and civic life, on one hand acting as a vital source of information for the people, while on the other, acting as a spy against the misdeeds of the mighty. The impact of news on people's decisions can be both positive and negative. Where news through newspapers is able to make people judge situations rationally, false news sometimes leads to grave social chaos or political uprise. It also changes the human perception of some of the existing affairs of the world. Sometimes, receiving nice news may make a person more confident, hopeful, and sufficient. However, not-so-nice news may also bring about certain psychological changes. The foremost cause of such changes is the vast increase in the quantity of reports on crimes, accidents, and suicides. Reports of this nature negatively affect the capability of human beings to remain emotionally balanced.

---



***FIRST WE LOAD OUR DATASET***

This program run in a Google Colab environment, loading a dataset containing fake and real news and inspecting it. The program begins by importing the Pandas library used in data manipulation and analysis. This is followed by mounting Google Drive to make the files saved there available in the Colab environment. Finally, after the successful mounting of the device, the dataset from the given location is loaded into a Pandas DataFrame.

In [None]:
import pandas as pd
#import dataset
from google.colab import drive
drive.mount('/content/drive')
dataset = pd.read_csv('/content/drive/MyDrive/Fake_Real_News_Data.csv')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


# ***CHECKING THE DATASET***

Before performing any data analysis, there is a need to first check the dataset to establish its structure and quality. This preparatory phase ensures that the data is complete, well prepared, and ready for analysis. This will enable us to find out any problems that may hamper the accuracy and reliability of our analysis by checking on some essential factors, including the number of rows and columns, the presence of missing values, and the amount of text material. The process also helps in vividly visualizing a dataset to ensure any preprocessing or cleaning is done before further investigation and analysis.

---




The following code makes use of the method head() to display the first five rows of the dataset variable. It is normal to inspect and confirm the data structure, column names, and contents from the top down. Viewing the first five observations lets the user assure themselves that the data has been imported properly and lets them get an early impression of the format and information content of the resultant dataset.


In [None]:
#The head() method will load the first top 5 rows in our dataset for checking our dataset
dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,A whirlwind day in D.C. showcases Trump’s unor...,Donald Trump endorsed an unabashedly noninterv...,REAL
1,1,"In Baltimore's call for federal police probe, ...",While some Justice Department investigations a...,REAL
2,2,Trump Proudly Declares: Most Of The People I’v...,Trump Proudly Declares: Most Of The People I’v...,FAKE
3,3,Inside the Trump-Bush melodrama: Decades of te...,Donald Trump spent a day in January 2014 hobno...,REAL
4,4,Shutdown clash to return in force by December,Notable names include Ray Washburne (Commerce)...,REAL


***CHECKING THE SIZE OF THE DATASET AND NULL VALUES***

It will first inspect the basic structure and contents of the dataset in question by identifying the number of the columns, the number of rows, and the overall number of words. The first line imports Pandas, which is simply needed for working with tabular data. The third line of code calculates and prints the number of rows and columns in the dataset through use of the shape attribute to give an overview of its size. The process goes through all the text columns' length, as it sums up the words in each entry and provides information on the further amount of text data in the dataset. It's good to be in the know, especially when working with text-based datasets.

Then it is further done by iterating over all text columns, counting the number of words in an entry, and summing up these counts to get a feeling for the amount of available text data. The method isnull().sum() has been used to find and count null values. This has also been used in getting the count of missing entries in each column and adding them up to arrive at the total null values in the dataset. This is useful in analyzing the completeness of the dataset and the volume of the text data, a step important for efficient cleaning and preparation of the data.


In [None]:
# Number of columns
num_columns = dataset.shape[1]
print(f"Number of columns: {num_columns}")

# Number of rows
num_rows = dataset.shape[0]
print(f"Number of rows: {num_rows}")

# Count the total number of words in the dataset (assuming text data is present)
word_count = 0
for col in dataset.select_dtypes(include=[object]):  # Only consider object (string) columns
    word_count += dataset[col].apply(lambda x: len(str(x).split())).sum()

print(f"Total number of words in the dataset: {word_count}")

# Count the number of null values in each column
null_values_per_column = dataset.isnull().sum()
print("\nNumber of null values in each column:")
print(null_values_per_column)

# Count the total number of null values in the dataset
total_null_values = null_values_per_column.sum()
print(f"\nTotal number of null values in the dataset: {total_null_values}")


Number of columns: 4
Number of rows: 6335
Total number of words in the dataset: 4990695

Number of null values in each column:
Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

Total number of null values in the dataset: 0


**DATASET NOISE**

The noise  found in this dataset include unwanted punctuation, misplaced quotation marks, dashes, and other extraneous items like the e-mail addresses "undergroundreporter2016@gmail.com" and URLs "https://t.co/VyTT49YvoE pic.twitter.com/wCvSCg4a5I" are present in this dataset.

Some of the noises like special characters like ®, ©, or ™ do not convey any type of information. Formatting errors may also result in such problems and may consist of incorrect capitalization or too much white space in the form of tabs, extra spaces, or a number of new lines. These may be some of the elements that make it really hard to comprehend and analyze data; thus, data cleaning and standardization are very important to enhance its quality.


# Pipeline
# ----------------------------------------------
## 1. Lowercasing
## 2. Special Character Removal
## 3. Stopword Removal
## 4. Tokenization

# Lowercasing

Lowercasing is another prominent text preprocessing method that ensures all characters in a document are in lowercase. This might provide consistency and treat variations such as the word "The" and "the" as one and the same thing. This simplifies the data with a reduced count of unique tokens, thus improving the efficiency of text analysis algorithms and machine learning models by ensuring that they are concerned with the content rather than the case.

---




This code snippet is used for single-case testing to ensure that some entry in a dataset has the correct content and format. The code will choose the first value from DataFrame's column 'text' and assign it to the variable test_case. Afterward, it will print this entry in the console. This procedure will help in checking and validating each data point to ensure that the data is as it should be, finding problems that there may be with specific records. The focus on just one case will enable you to conduct specific tests and make sure data is loaded and structured correctly.


In [None]:
#Single Case Testing for checking the case of the text
test_case = dataset['text'][0]
print(test_case)

Donald Trump endorsed an unabashedly noninterventionist approach to world affairs Monday during a day-long tour of Washington, casting doubt on the need for the North Atlantic Treaty Organization and expressing skepticism about a muscular U.S. military presence in Asia.

The foreign policy positions — outlined in a meeting with the editorial board of The Washington Post — came on a day when Trump set aside the guerrilla tactics and showman bravado that have powered his campaign to appear as a would-be presidential nominee, explaining his policies, accepting counsel and building bridges to Republican elites.

On Monday night, Trump delivered a scripted address in front of the American Israel Public Affairs Committee, prompting ovations with pledges to stand by Israel and take a hard line on peace talks with the Palestinians.

Trump’s whirlwind day of appearances around the nation’s capital was intended in part to head off an establishment push to deny him the Republican Party’s nominati

This code snippet standardizes text data for consistency and comparability. The lower() method was applied to modify the test_case variable contents to lower case before its result is printed. The code will convert all text into lower case, ensuring that any subsequent operations or tests done on the data are not muddled by case. This shall be particularly useful in any text processing work requiring uniformity.

In [None]:
#Lowercasing the test_case variable for checking
test_case = test_case.lower()
print(test_case)

donald trump endorsed an unabashedly noninterventionist approach to world affairs monday during a day-long tour of washington, casting doubt on the need for the north atlantic treaty organization and expressing skepticism about a muscular u.s. military presence in asia.

the foreign policy positions — outlined in a meeting with the editorial board of the washington post — came on a day when trump set aside the guerrilla tactics and showman bravado that have powered his campaign to appear as a would-be presidential nominee, explaining his policies, accepting counsel and building bridges to republican elites.

on monday night, trump delivered a scripted address in front of the american israel public affairs committee, prompting ovations with pledges to stand by israel and take a hard line on peace talks with the palestinians.

trump’s whirlwind day of appearances around the nation’s capital was intended in part to head off an establishment push to deny him the republican party’s nominati

This section of the code does text normalization to the whole column 'text' in DataFrame df. The apply method hits, along with the function lambda, changes all entries of column 'text' to lowercase, hence ensuring all text data is uniformly formatted. Afterwards, the head() will be used to show the first ten rows of the now-modified DataFrame. This step helps in preserving the consistency of the text data, which is very important for correct text analysis and comparisons, and for the capability to see at once the ways in which the data has changed.


In [None]:
#applying it to the dataset
dataset['text'] = dataset['text'].apply(lambda x: str(x).lower())
#checking for first 10 instances
dataset.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,A whirlwind day in D.C. showcases Trump’s unor...,donald trump endorsed an unabashedly noninterv...,REAL
1,1,"In Baltimore's call for federal police probe, ...",while some justice department investigations a...,REAL
2,2,Trump Proudly Declares: Most Of The People I’v...,trump proudly declares: most of the people i’v...,FAKE
3,3,Inside the Trump-Bush melodrama: Decades of te...,donald trump spent a day in january 2014 hobno...,REAL
4,4,Shutdown clash to return in force by December,notable names include ray washburne (commerce)...,REAL
5,5,Can the great nuclear war be prevented ?,"«current concerns», n°23, october 22th, 2016\n...",FAKE
6,6,US charges 61 with India-based scam involving ...,us charges 61 with india-based scam involving ...,FAKE
7,7,Democratic debate 2015: Hillary Clinton and Be...,watch the first democratic presidential debate...,REAL
8,8,What DNC Donors REALLY think of African Americans,this video is really disturbing... \nnot just ...,FAKE
9,9,"What happens to delegates won by Rubio, other ...",with marco rubio dropping out of the republica...,REAL



# Special Character Removal

A text preparation technique through which non-alphabet characters are removed includes characters like &, @, #, %, and $. This technique makes the data clean and consistent through concentration on the material in the documents and, respectively, enhances the quality in text analytics and machine learning models through the removal of irrelevant symbols that might intervene with the outcome.

---



The following single-case testing code identifies and inspects special characters in any entry of the data entry. This will extract the first entry from the column 'text' in the DataFrame, to be stored in the variable test_case, then it will print the entry on the console. In this way, you can check the text for special characters or anomalies, guaranteeing data integrity and cleanliness before further processing or analysis.


In [None]:
#Single Case Testing for locating special characters
test_case = dataset['text'][0]
print(test_case)

donald trump endorsed an unabashedly noninterventionist approach to world affairs monday during a day-long tour of washington, casting doubt on the need for the north atlantic treaty organization and expressing skepticism about a muscular u.s. military presence in asia.

the foreign policy positions — outlined in a meeting with the editorial board of the washington post — came on a day when trump set aside the guerrilla tactics and showman bravado that have powered his campaign to appear as a would-be presidential nominee, explaining his policies, accepting counsel and building bridges to republican elites.

on monday night, trump delivered a scripted address in front of the american israel public affairs committee, prompting ovations with pledges to stand by israel and take a hard line on peace talks with the palestinians.

trump’s whirlwind day of appearances around the nation’s capital was intended in part to head off an establishment push to deny him the republican party’s nominati

This will cleans up text data by removing unwanted characters using the re library. It uses the 're.sub' method against the test_case variable containing text from the column 'text' of DataFrame df. The 're.sub' method is used with the regular expression pattern [^a-zA-Z0-9\\s], which matches any character that is not a letter, a digit, or whitespace character. All characters that match this pattern are replaced by an empty string; therefore, they get deleted from the text. At last, cleaned text will be printed to make sure that there are no special or non-alphanumeric characters in a data.


In [None]:
#import library regex
import re
#remove all instances that aren't a-z or 0-9
test_case = re.sub("[^a-zA-Z0-9\s]",'',test_case)
print(test_case)

donald trump endorsed an unabashedly noninterventionist approach to world affairs monday during a daylong tour of washington casting doubt on the need for the north atlantic treaty organization and expressing skepticism about a muscular us military presence in asia

the foreign policy positions  outlined in a meeting with the editorial board of the washington post  came on a day when trump set aside the guerrilla tactics and showman bravado that have powered his campaign to appear as a wouldbe presidential nominee explaining his policies accepting counsel and building bridges to republican elites

on monday night trump delivered a scripted address in front of the american israel public affairs committee prompting ovations with pledges to stand by israel and take a hard line on peace talks with the palestinians

trumps whirlwind day of appearances around the nations capital was intended in part to head off an establishment push to deny him the republican partys nomination but in the pos

The following code cleans the column 'text' of a DataFrame from unwanted characters. In this example, apply() is used together with a lambda function calling re.sub() to clean each entry, removing the characters that are not letters (a-z, A-Z), not digits (0-9), and are not space. This approach to standardizing text data comes in handy when data is supposed to be limited only to alphanumeric characters and spaces. After cleaning all entries in the column 'text', it shall use the head() method to show the first ten rows of the refurnished DataFrame. This allows a quick check of the cleaned text, ensuring that undesirable characters are removed.


In [None]:
#apply to all
dataset['text'] = dataset['text'].apply(lambda x: re.sub("[^a-zA-Z0-9\s]",'',x))
dataset.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,A whirlwind day in D.C. showcases Trump’s unor...,donald trump endorsed an unabashedly noninterv...,REAL
1,1,"In Baltimore's call for federal police probe, ...",while some justice department investigations a...,REAL
2,2,Trump Proudly Declares: Most Of The People I’v...,trump proudly declares most of the people ive ...,FAKE
3,3,Inside the Trump-Bush melodrama: Decades of te...,donald trump spent a day in january 2014 hobno...,REAL
4,4,Shutdown clash to return in force by December,notable names include ray washburne commerce a...,REAL
5,5,Can the great nuclear war be prevented ?,current concerns n23 october 22th 2016\ncan th...,FAKE
6,6,US charges 61 with India-based scam involving ...,us charges 61 with indiabased scam involving 1...,FAKE
7,7,Democratic debate 2015: Hillary Clinton and Be...,watch the first democratic presidential debate...,REAL
8,8,What DNC Donors REALLY think of African Americans,this video is really disturbing \nnot just to ...,FAKE
9,9,"What happens to delegates won by Rubio, other ...",with marco rubio dropping out of the republica...,REAL


**For further inspection**

For further inspection of the data preprocessing this script examines a single entry in the column 'text' of the DataFrame df. It will print to the terminal what is in the 101st element, index 100. This checking will enable one to ensure that the text cleaning process has been applied correctly and this particular entry changed accordingly. Be sure that the data is well formatted and does not have unwanted characters.


In [None]:
#For inspection
print(dataset['text'][100])

posted on november 4 2016 by charles hugh smith 
the overwhelming consensus of the punditry across the political spectrum is that nothing good can come of this electionand thats a very good thing the handwringing goes like this the country is deeply divided by schisms that cannot be bridged every institution from the two parties to the mainstream media to the department of justice has been tarnished by coverups collusion or worse whomever wins the election will enter the presidency without a mandate and so on 
why is nothing good can come of this good because ridding the nation of its political corruption will require hitting bottom 
just as an alcoholic or drug addict is incapable of making any truly positive changes until heshe hits absolute bottom so it is with our tolerance of a corrupt political system that is poisoning the nation one injection of corrupt cash collusion and paytoplay at a time 
if our rottentothecore politics as usual is indeed flying off the cliff to complete des

# Stop word Removal

### Example english stopwords:
#### the, is, and

A stop word is a common word like "the," "is," or "and" that is removed from the document. The given text processing technique would purport to enhance the relevance and efficiency of text analysis in focusing on the more important terms that bear a relation to the general meaning. Stop word elimination purports to enhance the relevance and therefore efficiency of text analysis, focusing on the more significant terms that bear a direct relation to the general meaning. By cutting the noise that stop words present and focusing on very relevant content, task performance improves in text mining and natural language processing.


---



In [None]:
from nltk.corpus import stopwords

It loads the Stopwords corpus from the Natural Language Toolkit package. The stopwords corpus is a set of very common words, like "the", "is", "in", etc., that can most of the time be ignored during text processing since they add no meaningful value to the analysis. This corpus allows for activities like deleting stop words from text data, letting you focus on more important phrases during text analysis and natural language processing.


In [None]:
import nltk
nltk.download('stopwords')
#download and import english stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


This code sample makes use of the Natural Language Toolkit to download and import English stopwords. Once the library nltk is imported, one needs to call nltk.download('stopwords') for the stopwords dataset. After getting the stopwords dataset, the code calls stopwords.words('english') to import the list of English stop words, which it assigns to the variable stop_words. At last, it will print the list of stop words in the terminal. It concentrates on the more significant keywords, thus giving access to a pre-defined list of frequent English terms that can be removed from the text data to increase the relevance of the text analysis.


This code line does single-case examples checking, which is performed to assure that the values of the test_case variable are correct after the text cleaning processes. Publishing the value of the test_case gives you the ability to check that the text is well formatted and ready to proceed with analysis—like the deletion of stopwords. The task of this phase is to check that the data will be in the sort of form for which various text processing actions can be performed.


In [None]:
#Single Case Testing for checking stopwords
print(test_case)

donald trump endorsed an unabashedly noninterventionist approach to world affairs monday during a daylong tour of washington casting doubt on the need for the north atlantic treaty organization and expressing skepticism about a muscular us military presence in asia

the foreign policy positions  outlined in a meeting with the editorial board of the washington post  came on a day when trump set aside the guerrilla tactics and showman bravado that have powered his campaign to appear as a wouldbe presidential nominee explaining his policies accepting counsel and building bridges to republican elites

on monday night trump delivered a scripted address in front of the american israel public affairs committee prompting ovations with pledges to stand by israel and take a hard line on peace talks with the palestinians

trumps whirlwind day of appearances around the nations capital was intended in part to head off an establishment push to deny him the republican partys nomination but in the pos

This code, with the test_case variable, does remove the stop words and clean up the text. First, it converts the text into words separated by a space. After that, it initializes an empty list named final_test, where the final cleaned words are to be stored. The code runs through the split text; it actually runs through each word after a whitespace and runs through each word to check if it is either empty, contains a carriage return or newline characters (\r
), or is found in the list of stop words. If any of these criteria are not met, then it removes the term from the final list. Now the term would be added to final_test. Finally, it appends the cleaned words list, containing the results of stop words removal and other text cleaning steps.





In [None]:
#split the text per space
#print(test_case.split(' '))
#initialize empty list for finalization of stopword removal
final_test =[]
#remove all empty and \r\n
for word in test_case.split():
    if word =='' or '\r\n' in word or word in stop_words:
        None
    else:
        final_test.append(word)
print(final_test)

['donald', 'trump', 'endorsed', 'unabashedly', 'noninterventionist', 'approach', 'world', 'affairs', 'monday', 'daylong', 'tour', 'washington', 'casting', 'doubt', 'need', 'north', 'atlantic', 'treaty', 'organization', 'expressing', 'skepticism', 'muscular', 'us', 'military', 'presence', 'asia', 'foreign', 'policy', 'positions', 'outlined', 'meeting', 'editorial', 'board', 'washington', 'post', 'came', 'day', 'trump', 'set', 'aside', 'guerrilla', 'tactics', 'showman', 'bravado', 'powered', 'campaign', 'appear', 'wouldbe', 'presidential', 'nominee', 'explaining', 'policies', 'accepting', 'counsel', 'building', 'bridges', 'republican', 'elites', 'monday', 'night', 'trump', 'delivered', 'scripted', 'address', 'front', 'american', 'israel', 'public', 'affairs', 'committee', 'prompting', 'ovations', 'pledges', 'stand', 'israel', 'take', 'hard', 'line', 'peace', 'talks', 'palestinians', 'trumps', 'whirlwind', 'day', 'appearances', 'around', 'nations', 'capital', 'intended', 'part', 'head', '

The following line of code makes the list of cleaned words in final_test into a space-separated string. The ''.join(final_test) script is used to join all the different cleaned words combined in the final_test list into readable text for further analysis and processing.


In [None]:
#lets try to join the final test
' '.join(final_test)

'donald trump endorsed unabashedly noninterventionist approach world affairs monday daylong tour washington casting doubt need north atlantic treaty organization expressing skepticism muscular us military presence asia foreign policy positions outlined meeting editorial board washington post came day trump set aside guerrilla tactics showman bravado powered campaign appear wouldbe presidential nominee explaining policies accepting counsel building bridges republican elites monday night trump delivered scripted address front american israel public affairs committee prompting ovations pledges stand israel take hard line peace talks palestinians trumps whirlwind day appearances around nations capital intended part head establishment push deny republican partys nomination post meeting billionaire mogul also made clear would beholden gops longheld orthodoxies hourlong discussion revealed five foreign policy advisers trump advocated light footprint world spite unrest middle east elsewhere sa

Same as the last one but this one is a function that will apply and will remove all of the stopword in the whole dataset.

In [None]:
#make a function pipeline:
def remove_stopwords(text,stopwords):
    #make a temporary list
    temp_fnl = []
    for word in text.split():
        if word =='' or '\r\n' in word or word in stopwords:
            None
        else:
            temp_fnl.append(word)
    return ' '.join(temp_fnl)

The code above applies the remove_stopwords function to the 'Article' column of the entire DataFrame, df. Using the apply method and a lambda, the 'Article' column applies to each entry to process out stopwords and surplus characters as per the list stop_words. It then reassigns back the cleaned text to the column 'Article'. The improved DataFrame then invokes head(10), which will print the first ten rows. It can enable the confirmation of the cleaned text data and the results of the stop-word removal method.


In [None]:
#apply to the whole dataset
dataset['text'] = dataset['text'].apply(lambda x: remove_stopwords(x, stop_words))
dataset.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,A whirlwind day in D.C. showcases Trump’s unor...,donald trump endorsed unabashedly nonintervent...,REAL
1,1,"In Baltimore's call for federal police probe, ...",justice department investigations adversarial ...,REAL
2,2,Trump Proudly Declares: Most Of The People I’v...,trump proudly declares people ive insulted des...,FAKE
3,3,Inside the Trump-Bush melodrama: Decades of te...,donald trump spent day january 2014 hobnobbing...,REAL
4,4,Shutdown clash to return in force by December,notable names include ray washburne commerce d...,REAL
5,5,Can the great nuclear war be prevented ?,current concerns n23 october 22th 2016 great n...,FAKE
6,6,US charges 61 with India-based scam involving ...,us charges 61 indiabased scam involving 15000 ...,FAKE
7,7,Democratic debate 2015: Hillary Clinton and Be...,watch first democratic presidential debate tue...,REAL
8,8,What DNC Donors REALLY think of African Americans,video really disturbing african americans amer...,FAKE
9,9,"What happens to delegates won by Rubio, other ...",marco rubio dropping republican presidential r...,REAL


For further inspection if the stopwords are removed in the dataset

In [None]:
#For inspection
print(dataset['text'][150])

0 comments families united prayer thanksgiving day prayer alone comes mind think obamas america told leave god home pledge never crucial time look god nation divided torn apart selfish greed corrupt society leaders lead behind hide dark secrets right plain sightbecause walmart brought us thanksgiving commercial huge reminder good right reminder good foundation made grateful family prayer allcoming together 30second ad world series rates cost walmart 500000 features diversity americas families camaraderie service members gather pray enjoy bonds family commercial showing everyone praying eating great job walmart walmart deplorable jack deplorablejackl november 3 2016 final thoughts america lets come together take moment reflect truly thankful holiday season friends family chance spend time ones love walmart would like give thanks veterans active duty service members home oversees holiday season walmart invited spirit thanksgiving seconds video imagine everyday americans brought seconds t

# Tokenization

In the process of text preparation, tokenization segments a text into tokens, which may stand for a word, phrase, or sentence. Tokenizing a sentence like "Data preprocessing is crucial for analysis" would result in tokens such as "Data,", "preprocessing,", "is,", "crucial,", "for,", and "analysis.". Generally, the process of breaking down text into smaller parts so algorithms can analyze this text more effectively during tasks like word frequency counting, finding patterns, or building language models.


---




This code import statement is for the Natural Language Toolkit package. The download of the punkt tokenizer models is by using nltk.download('punkt'). The Punkt tokenizer is a pre-trained machine for splitting text into words or phrases. In many text processes, there is a basic requirement to break down the text into manageable parts, such as words or sentences, before further analyses or manipulations can take place.


In [None]:
import nltk
nltk.download('punkt')  # Download the necessary data for tokenization

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

The line below imports the word_tokenize function from the nltk.tokenize module. Probably one of the single most critical steps in Natural Language Processing is the word tokenize function, separating words from text into individual tokens. This function is instrumental in breaking down text into words so that more granular analysis and processing of the text data can be carried out.


In [None]:
from nltk.tokenize import word_tokenize

This code snippet applies the word_tokenize function to each entry of column 'text' in a DataFrame df. In this script, each text enter is processed by the apply method, and it tokenizes the sentences into words. The tokens that are produced are placed into a new column called 'tokenized_text'. This procedure is paramount in preparing the text data for further analysis—like frequency or sentiment analysis—since it breaks down a continuous flow of text into manageable, distinct pieces: the words.


In [None]:
# Apply word_tokenize function to each article in the 'text' column
dataset['tokenized_text'] = dataset['text'].apply(word_tokenize)

This line of code prints the first ten rows of the column 'tokenized_article' in the DataFrame df. Here it will give the result of tokenization done on the column 'text'; every entry is now a list of tokens—individual words—from the original text. This would make it very easy to inspect and validate the tokenized output for the first few entries within the dataset.


In [None]:
# Print the first 10 tokenized result
dataset['tokenized_text'].head(10)

Unnamed: 0,tokenized_text
0,"[donald, trump, endorsed, unabashedly, noninte..."
1,"[justice, department, investigations, adversar..."
2,"[trump, proudly, declares, people, ive, insult..."
3,"[donald, trump, spent, day, january, 2014, hob..."
4,"[notable, names, include, ray, washburne, comm..."
5,"[current, concerns, n23, october, 22th, 2016, ..."
6,"[us, charges, 61, indiabased, scam, involving,..."
7,"[watch, first, democratic, presidential, debat..."
8,"[video, really, disturbing, african, americans..."
9,"[marco, rubio, dropping, republican, president..."
