<a href="https://colab.research.google.com/github/shartazkhan/nlp_fundamentals/blob/main/NLP_Text_Preprocessing_2_on_custom_dataset.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# What we will do here


1.   Create a custom dataset from scrach.
2.   Apply data preprocessing on it


We will create a movie dataset with **3** columns,

**1) Movie name, 2) Description** and 3) **Genre**



In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import requests

# 1. Data scraping to create dataset

In [2]:
response = requests.get('https://api.themoviedb.org/3/discover/movie?api_key=8c687e4428d8187418aaf8d96c75de17&language=en-US&with_original_language=bn')

response

<Response [200]>

In [3]:
for page in range(1,188):
    link = f'https://api.themoviedb.org/3/discover/movie?api_key=8c687e4428d8187418aaf8d96c75de17&language=en-US&with_original_language=bn&page={page}'
    response = requests.get(link)
    temp_df = pd.DataFrame(response.json()['results']).head()[['id','title','overview','genre_ids','adult']]
    temp_df
    # df = df.append(temp_df,ignore_index=True)
    df=pd.concat([df,temp_df],ignore_index=True)

NameError: name 'df' is not defined

In [None]:
df.sample(5)

In [None]:
if df is not None:
    df.to_csv('tmdb_bengali_content.csv', index=False)
    print("DataFrame saved to bengali_movies.csv")
else:
    print("No DataFrame to save.")

# 2. I call it pre-preporcessing

In [4]:
df = pd.read_csv('/content/drive/MyDrive/Datasets/Bangla Movie 2k imdb/bangla_movies_imdb_2k.csv')

In [5]:
df.sample(5)

Unnamed: 0,year,run_time,short_description,rating,rating_count,genre,name
1187,2000,2h 16m,A landlord and his son are untouchable by the ...,8.6,(15),drama,Golam
1214,1999,2h 30m,The second movie directed by Humayun Ahmed. A ...,8.6,(2.8K),drama,Srabon Megher Din
957,2009,2h 18m,,6.5,(24),drama,Chander Moto Bou
1590,1982,2h 24m,Two childhood sweethearts departs from each ot...,7.5,(27),drama,Devdas
889,2012,,A story of king and kingdom.,,,drama,Raja Surja Kha


In [6]:
len(df)

2000

In [7]:
df = df[['name', 'genre', 'short_description']]
display(df.head())

Unnamed: 0,name,genre,short_description
0,Abar Jaago,drama,A national star cricketer Hasan Ahsan gets ban...
1,Mrucha Chiasodpoi,documentary,In the remote hills of the Chittagong Hill Tra...
2,Borbaad,"action, romance, thriller","After a heartbreak by Nitu, Ariyan Mirza seeks..."
3,Taandob,"action, thriller",A jobless villager moves to Dhaka for work but...
4,Jongli,drama,"A troubled university student, haunted by fami..."


In [8]:
df['short_description'][49]

nan

In [9]:
df.dropna(subset=['short_description'], inplace=True)
df.reset_index(drop=True, inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.dropna(subset=['short_description'], inplace=True)


In [10]:
len(df)

883

In [11]:
df['short_description'][49]

"Asif joins neighborhood cricket led by skilled batter Mahtab. They nickname a pretty woman 'Sania Mirza', sparking rumours of her being Mahtab's girlfriend. But when Asif meets her unexpectedly, a shocking truth comes to light."

# 3. Preprocessing (Real one)



1.   Convert data to lowercase
2.   Remove unnecessery things
3.   Chat word Treatment (if needed)
4.   Spell check
5.   Stopwords check
6.   Handling Emojis
7.   Tokenization
8.   Stemming / Lemmatization



## 1. Convert data to lowercase

In [12]:
df.loc[:, 'short_description'] = df['short_description'].str.lower()

display(df.sample(5))

Unnamed: 0,name,genre,short_description
587,Swopnodanay,drama,a poor villager finds some high priced foreign...
220,Rickshaw Girl,drama,a daring bangladeshi teenager attempts to help...
314,Buker Ba Pashe,drama,"in a bus journey, two people share two seats n..."
71,Mr. Engineer and Miss Doctor,romance,when engineer nihaal crosses paths with doctor...
192,Scooty,drama,"anu bahar, a girl from a middle class family, ..."


In [13]:
df['short_description'][49]

"asif joins neighborhood cricket led by skilled batter mahtab. they nickname a pretty woman 'sania mirza', sparking rumours of her being mahtab's girlfriend. but when asif meets her unexpectedly, a shocking truth comes to light."

## 2.   Remove unnecessery things

Remove HTML tags (if any)

In [14]:
import re

def remove_html_tags(text):
    clean = re.compile('<.*?>')
    return clean.sub(r'', str(text)) # Convert to string before applying regex

df.loc[:,'short_description'] = df['short_description'].astype(str).apply(remove_html_tags)



In [15]:
df['short_description'][49]

"asif joins neighborhood cricket led by skilled batter mahtab. they nickname a pretty woman 'sania mirza', sparking rumours of her being mahtab's girlfriend. but when asif meets her unexpectedly, a shocking truth comes to light."

Remove URLs

In [16]:
def remove_url(text):
    clean = re.compile('https?://\S+|www\.\S+')
    return clean.sub(r'', text)

df.loc[:,'short_description'] = df['short_description'].apply(remove_url)

display(df.sample(5))

Unnamed: 0,name,genre,short_description
63,Tokhon Jokhon,drama,raha and tanim wanted to get married and both ...
381,Rokto,action,it is loosely based on the long kiss goodnight...
622,Teardrops of Karnaphuli,documentary,a documentary film on the plight of the indige...
121,Pori,,puja chery is portraying the central character...
820,Ononto Prem,drama,a young man who wants to help his friend in lo...


In [17]:
df['short_description'][43]

'three interconnected romantic stories unfold, delving into the intricacies of love, passion, and heartbreak.'

Remove Punctuations

In [18]:
import string

def remove_punctuation(text):
    translator = str.maketrans('', '', string.punctuation)

    return text.translate(translator)

df.loc[:,'short_description'] = df['short_description'].apply(remove_punctuation)


display(df.sample(5))

Unnamed: 0,name,genre,short_description
366,Chitkini,drama,panchagarh rail station of north bangladesh is...
850,Sadharon Meye,drama,a story of a simple college going girl
793,Nazma,drama,a wifes love and integrity towards her husband...
54,Thikana Bodle Jay,romance,after completing his masters he is applying fo...
247,August 1975,history,the film portrays the initial events of the ra...


In [19]:
df['short_description'][49]

'asif joins neighborhood cricket led by skilled batter mahtab they nickname a pretty woman sania mirza sparking rumours of her being mahtabs girlfriend but when asif meets her unexpectedly a shocking truth comes to light'

## 3.   Chat word Treatment (if needed)


In [20]:
file_path = '/content/drive/MyDrive/Datasets/chat_words_dictionary.txt'

try:
    with open(file_path, 'r') as f:
        file_content = f.read()

        chat_words_dict = {}
        for line in file_content.strip().split('\n'):
            if '=' in line:
                abbr, full_form = line.split('=', 1)
                chat_words_dict[abbr.strip()] = full_form.strip()

        print("\nChat word dictionary created:")
        display(chat_words_dict)

except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")


Chat word dictionary created:


{'afaik': 'as far as i know',
 'afk': 'away from keyboard',
 'asap': 'as soon as possible',
 'atk': 'at the keyboard',
 'atm': 'at the moment',
 'a3': 'anytime anywhere anyplace',
 'bak': 'back at keyboard',
 'bbl': 'be back later',
 'bbs': 'be back soon',
 'bfn': 'bye for now',
 'brb': 'be right back',
 'brt': 'be right there',
 'btw': 'by the way',
 'b4': 'before',
 'cu': 'see you',
 'cul8r': 'see you later',
 'faq': 'frequently asked questions',
 'fc': 'fingers crossed',
 'fwiw': 'for what its worth',
 'fyi': 'for your information',
 'gal': 'get a life',
 'gg': 'good game',
 'gn': 'good night',
 'gday': 'good day',
 'gmta': 'great minds think alike',
 'gr8': 'great',
 'g9': 'genius',
 'iykyk': 'if you know you know',
 'ic': 'i see',
 'icq': 'i seek you also a chat program',
 'ilu': 'ilu i love you',
 'imho': 'in my honesthumble opinion',
 'imo': 'in my opinion',
 'iow': 'in other words',
 'irl': 'in real life',
 'kiss': 'keep it simple stupid',
 'ldr': 'long distance relationship',


We don't need this here. If we were working with reviews this could be helpful.

## 4.   Spell check


We also don't need this in this dataset. If we were working with reviews this could be helpful.

In [21]:
# !pip install autocorrect

from autocorrect import Speller

spell = Speller(lang='en')

df['short_description'] = df['short_description'].apply(spell)

display(df.sample(5))

Unnamed: 0,name,genre,short_description
473,Boishommo,drama,presents the young vibrant architecture scene ...
63,Tokhon Jokhon,drama,raja and anim wanted to get married and both o...
48,Gyani Goni,family,gone a village auto mechanic is known far and ...
97,Puff Daddy,drama,supernatural gifted fortune teller puff daddy ...
256,I'm Fine Fake Friends,biography,the story begins with a girl and a boy a girl ...


## 5.   Stopwords Remove


In [22]:
from nltk.corpus import stopwords
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [23]:
stopwords.words('english')

['a',
 'about',
 'above',
 'after',
 'again',
 'against',
 'ain',
 'all',
 'am',
 'an',
 'and',
 'any',
 'are',
 'aren',
 "aren't",
 'as',
 'at',
 'be',
 'because',
 'been',
 'before',
 'being',
 'below',
 'between',
 'both',
 'but',
 'by',
 'can',
 'couldn',
 "couldn't",
 'd',
 'did',
 'didn',
 "didn't",
 'do',
 'does',
 'doesn',
 "doesn't",
 'doing',
 'don',
 "don't",
 'down',
 'during',
 'each',
 'few',
 'for',
 'from',
 'further',
 'had',
 'hadn',
 "hadn't",
 'has',
 'hasn',
 "hasn't",
 'have',
 'haven',
 "haven't",
 'having',
 'he',
 "he'd",
 "he'll",
 'her',
 'here',
 'hers',
 'herself',
 "he's",
 'him',
 'himself',
 'his',
 'how',
 'i',
 "i'd",
 'if',
 "i'll",
 "i'm",
 'in',
 'into',
 'is',
 'isn',
 "isn't",
 'it',
 "it'd",
 "it'll",
 "it's",
 'its',
 'itself',
 "i've",
 'just',
 'll',
 'm',
 'ma',
 'me',
 'mightn',
 "mightn't",
 'more',
 'most',
 'mustn',
 "mustn't",
 'my',
 'myself',
 'needn',
 "needn't",
 'no',
 'nor',
 'not',
 'now',
 'o',
 'of',
 'off',
 'on',
 'once',
 'on

In [24]:
def remove_stopwords(text):
    new_text = []

    for word in text.split():
        if word in stopwords.words('english'):
            new_text.append('')
        else:
            new_text.append(word)
    return " ".join(new_text)

In [25]:
df['short_description'] = df['short_description'].apply(remove_stopwords)

display(df.sample(5))

Unnamed: 0,name,genre,short_description
97,Puff Daddy,drama,supernatural gifted fortune teller puff daddy ...
801,Emiler Goenda Bahini,comedy,boy widow went train trip dhaka receiv...
508,Uttarer Sur,drama,life story street singer little daughter ...
781,Ma o Chhele,drama,son takes revenge insult mother uncle ...
779,Dahan,drama,struggle lower middle class family militar...


## 6.   Handling Emojis


In [26]:
# We don't need to

## 7.   Tokenization


In [27]:
from nltk import word_tokenize
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [28]:
df['short_description'] = df['short_description'].apply(word_tokenize)

display(df.head())

Unnamed: 0,name,genre,short_description
0,Abar Jaago,drama,"[national, star, cricketer, hasan, hasan, gets..."
1,Mrucha Chiasodpoi,documentary,"[remote, hills, chittagong, hill, tracts, bang..."
2,Borbaad,"action, romance, thriller","[heartbreak, situ, asian, mira, seeks, revenge..."
3,Taandob,"action, thriller","[nobles, villager, moves, dhaka, work, gets, e..."
4,Jongli,drama,"[troubled, university, student, haunted, famil..."


## 8.   Stemming / Lemmatization

In [29]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [30]:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import word_tokenize

nltk.download('wordnet')

wordnet_lemmatizer = WordNetLemmatizer()

def lemmatize_words(word_list):
  return [wordnet_lemmatizer.lemmatize(word) for word in word_list]

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [32]:
df['short_description'] = df['short_description'].apply(lemmatize_words)

display(df.head())

Unnamed: 0,name,genre,short_description
0,Abar Jaago,drama,"[national, star, cricketer, hasan, hasan, get,..."
1,Mrucha Chiasodpoi,documentary,"[remote, hill, chittagong, hill, tract, bangla..."
2,Borbaad,"action, romance, thriller","[heartbreak, situ, asian, mira, seek, revenge,..."
3,Taandob,"action, thriller","[noble, villager, move, dhaka, work, get, enta..."
4,Jongli,drama,"[troubled, university, student, haunted, famil..."


In [33]:
df['short_description'][56]

['young',
 'girl',
 'mother',
 'go',
 'middle',
 'east',
 'migration',
 'work',
 'disappears',
 'confronted',
 'unendurable',
 'circumstance',
 'eviction',
 'overdue',
 'tuition',
 'grocery',
 'ultimatum',
 'force',
 'become',
 'house',
 'maid']