<a href="https://colab.research.google.com/github/simodepth/Entities/blob/main/How_to_perform_tokenization_in_NLP_with_NLTK_and_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#How to Kick Off Entity Research with Tokenization in NLP


---

**Tokenization** is a data science method that reduces the words in a sentence into a comma-separated list of distinct words or values. It’s the entry gate to start processing text data during Natural Language Processing (**NLP**).

Before diving into more granular NLP machine learning techniques, you need to fragment your data. There are many coding languages out there that you can hand over this task, but Python is undoubtedly the easiest one to manipulate. 

In this framwork, we are going to leverage Python’s Natural Language Toolkit (**NLTK**) to take text data from a Pandas data frame and return a tokenized list of words.


#Requirements & Assumptions


---

- Run the script either on Google Colab or Jupyter Notebook
- In case you run the script on Jupyter Notebook, make sure to `!3pip install nltk`
- Run a crawl with Screaming Frog and export an internal_html CSV file equipped with the  following columns:
 1. url
 2. title 
 3. description 
 4. H1 

In [9]:
#@title Import the Packages
import pandas as pd 
import nltk 

In [10]:
#@title Import the Data
df = pd.read_excel('/content/interni_html.xlsx') #@param {type:"string"}
df.head()

Unnamed: 0,URL,title,description,H1-2
0,https://seodepths.com/,SEO Depth - SEO News & Python scripts for SEO,SEO News & Python scripts for SEO,SEO Depth
1,https://seodepths.com/seo-news/google-pros-con...,Google Pros and Cons are not Rich Results but ...,Have you noticed a new Pros & Cons SERP Featur...,Google Pros and Cons are not Rich Results but ...
2,https://seodepths.com/python-for-seo/detect-go...,How to Detect Google Title Tag Rewriting with ...,🐍 Is Google tweaking your Meta Tags on the SER...,How to Detect Google Title Tag Rewriting with ...
3,https://seodepths.com/python-for-seo/sitemap-a...,How to Automate a Sitemap Audit with Python - ...,🐍 Learn how to automate a sitemap.xml audit fo...,How to Automate a Sitemap Audit with Python
4,https://seodepths.com/about/,Simone De Palma - SEO Depth,Learn who is behind the scenes of the SEO Dept...,Simone De Palma


In [11]:
#@title Concatenate the Text into a single column
df['text'] = df['title'] + ' ' + df['description']


In [12]:
#@title Remove NaN values and cast to string
df['text'].dropna(inplace=True)
df['text'] = df['text'].astype(str)
df.head()

Unnamed: 0,URL,title,description,H1-2,text
0,https://seodepths.com/,SEO Depth - SEO News & Python scripts for SEO,SEO News & Python scripts for SEO,SEO Depth,SEO Depth - SEO News & Python scripts for SEO ...
1,https://seodepths.com/seo-news/google-pros-con...,Google Pros and Cons are not Rich Results but ...,Have you noticed a new Pros & Cons SERP Featur...,Google Pros and Cons are not Rich Results but ...,Google Pros and Cons are not Rich Results but ...
2,https://seodepths.com/python-for-seo/detect-go...,How to Detect Google Title Tag Rewriting with ...,🐍 Is Google tweaking your Meta Tags on the SER...,How to Detect Google Title Tag Rewriting with ...,How to Detect Google Title Tag Rewriting with ...
3,https://seodepths.com/python-for-seo/sitemap-a...,How to Automate a Sitemap Audit with Python - ...,🐍 Learn how to automate a sitemap.xml audit fo...,How to Automate a Sitemap Audit with Python,How to Automate a Sitemap Audit with Python - ...
4,https://seodepths.com/about/,Simone De Palma - SEO Depth,Learn who is behind the scenes of the SEO Dept...,Simone De Palma,Simone De Palma - SEO Depth Learn who is behin...


In [13]:
#@title  Create a tokenizer using NLTK
nltk.download('punkt');


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [14]:
def tokenize(column):
    """Tokenizes a Pandas dataframe column and returns a list of tokens.

    Args:
        column: Pandas dataframe column (i.e. df['text']).

    Returns:
        tokens (list): Tokenized list, i.e. [Donald, Trump, tweets]
    """

    tokens = nltk.word_tokenize(column)
    return [w for w in tokens if w.isalpha()]    

In [15]:
#@title Tokenize your text data using NLTK
df['tokenized'] = df.apply(lambda x: tokenize(x['text']), axis=1)
df[['tokenized']].head()

Unnamed: 0,tokenized
0,"[SEO, Depth, SEO, News, Python, scripts, for, ..."
1,"[Google, Pros, and, Cons, are, not, Rich, Resu..."
2,"[How, to, Detect, Google, Title, Tag, Rewritin..."
3,"[How, to, Automate, a, Sitemap, Audit, with, P..."
4,"[Simone, De, Palma, SEO, Depth, Learn, who, is..."


In [22]:
#@title Download your Data Frame
df.to_excel(r'iCloud Drive\Scrivania\tokenization.xlsx', index = False, header=True)