GOAL
- Build a python pipeline to:
 1. Clean text (remove punctuations, digits and symbols)
 2. Tokenize and remove stopwords
 3. Save cleaned output for future use (CSV/JSON)

In [1]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11678 sha256=34d8eb9f58d58f8e37e76e633119954ff000efe1220926de6d25f091cf60c654
  Stored in directory: /root/.cache/pip/wheels/8f/ab/cb/45ccc40522d3a1c41e1d2ad53b8f33a62f394011ec38cd71c6
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [2]:
import wikipedia
data = wikipedia.page("Machester United").content

In [3]:
data

'Manchester United Football Club, commonly referred to as Man United (often stylised as Man Utd) or simply United, is a professional football club based in Old Trafford, Greater Manchester, England. They compete in the Premier League, the top tier of English football. Nicknamed the Red Devils, they were founded as Newton Heath LYR Football Club in 1878, but changed their name to Manchester United in 1902. After a spell playing in Clayton, Manchester, the club moved to their current stadium, Old Trafford, in 1910.\nDomestically, Manchester United have won a joint-record twenty top-flight league titles, thirteen FA Cups, six League Cups and a record twenty-one FA Community Shields. Additionally, in international football, they have won the European Cup/UEFA Champions League three times, and the UEFA Europa League, the UEFA Cup Winners\' Cup, the UEFA Super Cup, the Intercontinental Cup and the FIFA Club World Cup once each. Appointed as manager in 1945, Matt Busby built a team with an av

In [4]:
import nltk
import re
import pandas as pd
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

In [5]:
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.


True

In [6]:
# remove punctuations digits
# improve look for way to remove numbers that are not in context to the characters
text = data

In [7]:
class DataCleaner:
  def __init__(self, text: str):
    self.text = text
    self.token = []

  def remove_punctuations(self) -> str:
    self.text = re.sub(r'[^a-zA-Z0-9\s]','', self.text.lower())
    return self.text

  def tokenizer(self, text: str) -> list[str]:
    self.token = nltk.word_tokenize(text)
    return self.token

  def remove_stopwrods(self, tokens: list[str])-> list[str]:
    self.token=  [t for t in tokens if t not in stopwords.words()]
    return self.token

  def stemming(self, tokens: list[str]) -> list[str]:
    stemmer = PorterStemmer()
    self.token = [stemmer.stem(t) for t in tokens]
    return self.token

  def lemmatizer(self, tokens: list[str])-> list[str]:
    lemma = WordNetLemmatizer()
    self.token = [lemma.lemmatize(t) for t in tokens]
    return self.token

  def cleaned_text(self) -> str:
    self.text = self.remove_punctuations()
    self.token = self.tokenizer(self.text)
    self.token = self.remove_stopwrods(self.token)
    self.token = self.remove_stopwrods(self.token)
    self.token = self.lemmatizer(self.token)

    return " ".join(self.token)

In [8]:
cleaner = DataCleaner(text)
text = cleaner.cleaned_text()
text

'manchester united football club commonly referred united stylised utd simply united professional football club based trafford greater manchester england compete premier league top tier english football nicknamed red devil founded newton heath lyr football club 1878 changed manchester united 1902 spell playing clayton manchester club moved current stadium trafford 1910 domestically manchester united jointrecord twenty topflight league title thirteen cup league cup record twentyone community shield additionally international football european cupuefa champion league time uefa europa league uefa cup winner cup uefa super cup intercontinental cup fifa club world cup appointed manager 1945 matt busby built team average age 22 nicknamed busby babe successive league title 1950s english club compete european cup player killed munich air disaster busby rebuilt team star player george denis law bobby charlton united trinity league title english club win european cup 1968 busby retirement manche

In [9]:
##

In [10]:
# prompt: donwload data from link

!pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


In [11]:
import opendatasets as od
od.download('https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset')

Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: suelahmed
Your Kaggle Key: ··········
Dataset URL: https://www.kaggle.com/datasets/abhi8923shriv/sentiment-analysis-dataset
Downloading sentiment-analysis-dataset.zip to ./sentiment-analysis-dataset


100%|██████████| 54.4M/54.4M [00:00<00:00, 1.36GB/s]







In [12]:
data = pd.read_csv('/content/sentiment-analysis-dataset/train.csv', encoding='latin')
data.head()

Unnamed: 0,textID,text,selected_text,sentiment,Time of Tweet,Age of User,Country,Population -2020,Land Area (Km²),Density (P/Km²)
0,cb774db0d1,"I`d have responded, if I were going","I`d have responded, if I were going",neutral,morning,0-20,Afghanistan,38928346,652860.0,60
1,549e992a42,Sooo SAD I will miss you here in San Diego!!!,Sooo SAD,negative,noon,21-30,Albania,2877797,27400.0,105
2,088c60f138,my boss is bullying me...,bullying me,negative,night,31-45,Algeria,43851044,2381740.0,18
3,9642c003ef,what interview! leave me alone,leave me alone,negative,morning,46-60,Andorra,77265,470.0,164
4,358bd9e861,"Sons of ****, why couldn`t they put them on t...","Sons of ****,",negative,noon,60-70,Angola,32866272,1246700.0,26


In [48]:
text_df = pd.DataFrame(data['text'].head(20),columns=['text'])
text_df.head()

Unnamed: 0,text
0,"I`d have responded, if I were going"
1,Sooo SAD I will miss you here in San Diego!!!
2,my boss is bullying me...
3,what interview! leave me alone
4,"Sons of ****, why couldn`t they put them on t..."


In [49]:
text_df['cleaned_text'] = text_df.text.apply(lambda x: DataCleaner(str(x)).cleaned_text())

In [50]:
text_df.head()

Unnamed: 0,text,cleaned_text
0,"I`d have responded, if I were going",id responded
1,Sooo SAD I will miss you here in San Diego!!!,sooo sad miss san diego
2,my boss is bullying me...,bos bullying
3,what interview! leave me alone,interview leave
4,"Sons of ****, why couldn`t they put them on t...",put release bought


In [51]:
text_df.to_csv('cleaned_text.csv')