<a href="https://colab.research.google.com/github/sarathi-vs13/Natural-Language-Processing/blob/main/Raw_text_data_from_a_Social_Media_post.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Extracting raw text data from a messy social media post

## Using NLTK

In [48]:
#nltk.download('popular')
#nltk.download('punkt_tab')
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
from collections import Counter
import string
import warnings
warnings.filterwarnings("ignore")

In [49]:
def clean_text(text):
  text = text.lower()
  tokens = word_tokenize(text)

  clean_tokens = []
  for token in tokens:
    if token.startswith("@") or token.startswith(".") or token.startswith("#"):
      continue
    if token.startswith("http") or token.startswith("www"):
      continue
    if token in string.punctuation:
      continue
    if token.isdigit():
      continue
    clean_tokens.append(token)


  stop_words = set(stopwords.words('english'))
  filtered = [ word for word in clean_tokens if word not in stop_words]

  lemmatizer = WordNetLemmatizer()
  lemmatized = [lemmatizer.lemmatize(word, pos='v') for word in filtered]

  return lemmatized


In [50]:
messy_text = """
5 min ago my reddit all turned spanish....all the tabs, preferences..etc. I went into preferences and made sure they were checked to english....they were....what is going on? I cant read spanish so I am in need of some help here....i am asking you b/c I cannot find the mod help link b/c I cannot read it.

UPDATE- ok- it must have something to do with firefox. And to all of you telling me how to change lang. preference, OF COURSE I TRIED THAT before I posted. On IE all is normal. On my desktop all is normal. On my netbook, using firefox, it is a taco show. I ran the page through google translator and I especially enjoy the rick roll. So anyone know how to un-spanish reddit in firefox? This is the only page it is happening on.

EDIT- I must admit this is hilarious. I wish i had paid more attention in spanish class....

UPDATE- So I wake up this morning to about 1500 replies in my inbox that I cannot read. And then I run them through Google translator and most of them say stuff like "the dog is in my pants" and "where is the library".

Thanks, reddit.

As far as the Spanish problem goes.. I disabled all my firefox extensions, cleared all my cookies and restarted it all again. THE SPANISH IS GONE! I do not know what possessed my computer to run for the border, but I am glad it is back. :)

"""

print(clean_text(messy_text))

['min', 'ago', 'reddit', 'turn', 'spanish', 'tabs', 'preferences', 'etc', 'go', 'preferences', 'make', 'sure', 'check', 'english', 'go', 'cant', 'read', 'spanish', 'need', 'help', 'ask', 'b/c', 'find', 'mod', 'help', 'link', 'b/c', 'read', 'update-', 'ok-', 'must', 'something', 'firefox', 'tell', 'change', 'lang', 'preference', 'course', 'try', 'post', 'ie', 'normal', 'desktop', 'normal', 'netbook', 'use', 'firefox', 'taco', 'show', 'run', 'page', 'google', 'translator', 'especially', 'enjoy', 'rick', 'roll', 'anyone', 'know', 'un-spanish', 'reddit', 'firefox', 'page', 'happen', 'edit-', 'must', 'admit', 'hilarious', 'wish', 'pay', 'attention', 'spanish', 'class', 'update-', 'wake', 'morning', 'reply', 'inbox', 'read', 'run', 'google', 'translator', 'say', 'stuff', 'like', '``', 'dog', 'pant', "''", '``', 'library', "''", 'thank', 'reddit', 'far', 'spanish', 'problem', 'go', 'disable', 'firefox', 'extensions', 'clear', 'cookies', 'restart', 'spanish', 'go', 'know', 'possess', 'computer

#### Writing a function that return the most frequent lemmas used in the above social media post.

In [51]:
def most_freq_lemmas(lemmatized_list, n =5):
  freq = Counter(lemmatized_list)
  return freq.most_common(n)

In [52]:
lemmas = clean_text(messy_text)
top_lemmas = most_freq_lemmas(lemmas)

print(top_lemmas)

[('spanish', 5), ('go', 4), ('firefox', 4), ('reddit', 3), ('read', 3)]


##Using spaCy

In [57]:
import spacy

nlp = spacy.load("en_core_web_sm")

def clean_text_spacy(text):
  doc = nlp(text)
  clean_tokens = []
  for token in doc:
    if (
        not token.is_stop
        and not token.is_punct
        and token.is_alpha
    ):
      clean_tokens.append(token.lemma_.lower())
  return clean_tokens

In [58]:
text = """
EA's infamous "pride and accomplishment" post that got them 668,000 downvotes, the most downvoted post in the history of Reddit. Funnily enough the post is also one of the highest awarded in the site, presumably because awards help bring visibility to the post which will make more people downvote it.

"""
print(clean_text_spacy(text))

['ea', 'infamous', 'pride', 'accomplishment', 'post', 'get', 'downvote', 'downvoted', 'post', 'history', 'reddit', 'funnily', 'post', 'high', 'award', 'site', 'presumably', 'award', 'help', 'bring', 'visibility', 'post', 'people', 'downvote']
