# Introduction to NLP


 A notebook for the concepts presented - [A Gentle Introduction to Natural Language Processing](https://towardsdatascience.com/a-gentle-introduction-to-natural-language-processing-e716ed3c0863) 


## Basic Lib Setup

In [5]:
!conda install --yes nltk
!conda install --yes numpy
!conda install --yes scikit-learn
!conda install --yes pandas

import re
import nltk
import numpy as np
import pandas as pd

nltk.download('punkt')
nltk.download('stopwords')

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.



[nltk_data] Downloading package punkt to /home/jovyan/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

## Running Example

In [3]:
# sample description from Google App store

description = "<p> Instagram (from Facebook) allows you to create and share your photos, stories, and videos with the friends and followers you care about. Connect with friends, share what you're up to, or see what's new from others all over the world. Explore our community where you can feel free to be yourself and share everything from your daily moments to life's highlights.</p>"

print(description)

<p> Instagram (from Facebook) allows you to create and share your photos, stories, and videos with the friends and followers you care about. Connect with friends, share what you're up to, or see what's new from others all over the world. Explore our community where you can feel free to be yourself and share everything from your daily moments to life's highlights.</p>


## Data Normalization

Example removing html characters

In [6]:
cleaned_description = re.sub(re.compile('<.*?>'), "", description)
cleaned_description = re.sub(re.compile('\(.*?\)'), "", cleaned_description)
cleaned_description = cleaned_description.strip()
print(cleaned_description)

Instagram  allows you to create and share your photos, stories, and videos with the friends and followers you care about. Connect with friends, share what you're up to, or see what's new from others all over the world. Explore our community where you can feel free to be yourself and share everything from your daily moments to life's highlights.


## Data Normalization

Example converting to lowercase

In [7]:
cleaned_description = cleaned_description.lower()
print(cleaned_description)

instagram  allows you to create and share your photos, stories, and videos with the friends and followers you care about. connect with friends, share what you're up to, or see what's new from others all over the world. explore our community where you can feel free to be yourself and share everything from your daily moments to life's highlights.


## Tokenization



In [8]:
from nltk.tokenize import word_tokenize 

tokens = nltk.word_tokenize(cleaned_description)

print(cleaned_description)
print(tokens)

instagram  allows you to create and share your photos, stories, and videos with the friends and followers you care about. connect with friends, share what you're up to, or see what's new from others all over the world. explore our community where you can feel free to be yourself and share everything from your daily moments to life's highlights.
['instagram', 'allows', 'you', 'to', 'create', 'and', 'share', 'your', 'photos', ',', 'stories', ',', 'and', 'videos', 'with', 'the', 'friends', 'and', 'followers', 'you', 'care', 'about', '.', 'connect', 'with', 'friends', ',', 'share', 'what', 'you', "'re", 'up', 'to', ',', 'or', 'see', 'what', "'s", 'new', 'from', 'others', 'all', 'over', 'the', 'world', '.', 'explore', 'our', 'community', 'where', 'you', 'can', 'feel', 'free', 'to', 'be', 'yourself', 'and', 'share', 'everything', 'from', 'your', 'daily', 'moments', 'to', 'life', "'s", 'highlights', '.']


## Stop Word Removal

In [9]:
from nltk.corpus import stopwords

stop_words = stopwords.words('english')
filtered_description = [word for word in tokens if word not in stop_words]
print(filtered_description)

['instagram', 'allows', 'create', 'share', 'photos', ',', 'stories', ',', 'videos', 'friends', 'followers', 'care', '.', 'connect', 'friends', ',', 'share', "'re", ',', 'see', "'s", 'new', 'others', 'world', '.', 'explore', 'community', 'feel', 'free', 'share', 'everything', 'daily', 'moments', 'life', "'s", 'highlights', '.']


## Stemming

In [7]:
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
stemmed_description = [stemmer.stem(word) for word in filtered_description]
print(stemmed_description)

['instagram', 'allow', 'creat', 'share', 'photo', ',', 'stori', ',', 'video', 'friend', 'follow', 'care', '.', 'connect', 'friend', ',', 'share', "'re", ',', 'see', "'s", 'new', 'other', 'world', '.', 'explor', 'commun', 'feel', 'free', 'share', 'everyth', 'daili', 'moment', 'life', "'s", 'highlight', '.']


## Lemmatization

In [8]:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lemmatizer = WordNetLemmatizer()
lemma_description = [lemmatizer.lemmatize(word) for word in filtered_description]
print(lemma_description)

[nltk_data] Downloading package wordnet to /home/jovyan/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


['instagram', 'allows', 'create', 'share', 'photo', ',', 'story', ',', 'video', 'friend', 'follower', 'care', '.', 'connect', 'friend', ',', 'share', "'re", ',', 'see', "'s", 'new', 'others', 'world', '.', 'explore', 'community', 'feel', 'free', 'share', 'everything', 'daily', 'moment', 'life', "'s", 'highlight', '.']


## N-grams

In [9]:
from nltk import ngrams

n = 3
trigrams = [grams for grams in ngrams(cleaned_description.split(), n)]
print(trigrams)


[('instagram', 'allows', 'you'), ('allows', 'you', 'to'), ('you', 'to', 'create'), ('to', 'create', 'and'), ('create', 'and', 'share'), ('and', 'share', 'your'), ('share', 'your', 'photos,'), ('your', 'photos,', 'stories,'), ('photos,', 'stories,', 'and'), ('stories,', 'and', 'videos'), ('and', 'videos', 'with'), ('videos', 'with', 'the'), ('with', 'the', 'friends'), ('the', 'friends', 'and'), ('friends', 'and', 'followers'), ('and', 'followers', 'you'), ('followers', 'you', 'care'), ('you', 'care', 'about.'), ('care', 'about.', 'connect'), ('about.', 'connect', 'with'), ('connect', 'with', 'friends,'), ('with', 'friends,', 'share'), ('friends,', 'share', 'what'), ('share', 'what', "you're"), ('what', "you're", 'up'), ("you're", 'up', 'to,'), ('up', 'to,', 'or'), ('to,', 'or', 'see'), ('or', 'see', "what's"), ('see', "what's", 'new'), ("what's", 'new', 'from'), ('new', 'from', 'others'), ('from', 'others', 'all'), ('others', 'all', 'over'), ('all', 'over', 'the'), ('over', 'the', 'worl

## Bag of Words

In [10]:
print(tokens)

bag_words = np.unique(np.array(tokens)).tolist()
print('\n')
print(bag_words)

['instagram', 'allows', 'you', 'to', 'create', 'and', 'share', 'your', 'photos', ',', 'stories', ',', 'and', 'videos', 'with', 'the', 'friends', 'and', 'followers', 'you', 'care', 'about', '.', 'connect', 'with', 'friends', ',', 'share', 'what', 'you', "'re", 'up', 'to', ',', 'or', 'see', 'what', "'s", 'new', 'from', 'others', 'all', 'over', 'the', 'world', '.', 'explore', 'our', 'community', 'where', 'you', 'can', 'feel', 'free', 'to', 'be', 'yourself', 'and', 'share', 'everything', 'from', 'your', 'daily', 'moments', 'to', 'life', "'s", 'highlights', '.']


["'re", "'s", ',', '.', 'about', 'all', 'allows', 'and', 'be', 'can', 'care', 'community', 'connect', 'create', 'daily', 'everything', 'explore', 'feel', 'followers', 'free', 'friends', 'from', 'highlights', 'instagram', 'life', 'moments', 'new', 'or', 'others', 'our', 'over', 'photos', 'see', 'share', 'stories', 'the', 'to', 'up', 'videos', 'what', 'where', 'with', 'world', 'you', 'your', 'yourself']


## TF-IDF

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer

Document1= "Rehman is a software engineering researcher"
Document2= "Akond is a researcher"
Document3= "Akond is also a professor"
Doc = [Document1 , Document2 , Document3]
print(Doc)
print("\n")

vectorizer = TfidfVectorizer(use_idf=True)
tfIdf = vectorizer.fit_transform(Doc)
df = pd.DataFrame(tfIdf[0].T.todense(), index=vectorizer.get_feature_names_out(), columns=["TF-IDF"])
df = df.sort_values('TF-IDF', ascending=True)
print (df.head(50))

['Rehman is a software engineering researcher', 'Akond is a researcher', 'Akond is also a professor']


               TF-IDF
akond        0.000000
also         0.000000
professor    0.000000
is           0.298032
researcher   0.383770
engineering  0.504611
rehman       0.504611
software     0.504611


## Stanford Typed Dependencies

https://corenlp.run/