# Keyword Extractor

I have be using NLTK module of Python to extract keywords from reviews.csv file of AirBnB public dataset

Apart from using pandas dataframe for reading and writing csv file, will carry basic Text Cleaning such as
* Tokenization
* Stopwords Removal
* Lemmatization
* POS tagging

Have used POS tagging for keyword extraction. Generally adjectives are used for sentiment analysis, but here I have picked adjective and adverb from comments field of review, so that this application be easily extended for sentiment analysis  

# Tokenizing

* It refers to the splitting of sentences and words from the body of text into sentence tokens or word tokens respectively. 
* It is an essential part of NLP, as many modules work better (or only) with tags. For example, pos_tag needs tags as input and not the words, to tag them by parts of speech.

## Token

Each "entity" that is a part of whatever was split up based on rules. For examples, each word is a token when a sentence is "tokenized" into words. Each sentence can also be a token, if you tokenized the sentences out of a paragraph.

## Stop Words

* Stop words are the words which are mostly used as fillers and hardly have any useful meaning. 
* We should avoid these words from taking up space in database or taking up valuable processing time. 
* We can easily make a list of words to be used as stop words and then filter these words from the data we want to process.

# POS Tagging

* Part of speech tagging creates tuples of words and parts of speech. 
* It labels words in a sentence as nouns, adjectives, verbs,etc. It can also label by tense, and more. 
* These tags mean whatever they meant in your original training data. 
* You are free to invent your own tags in your training data, as long as you are consistent in their usage.
* Training data generally takes a lot of work to create, so a pre-existing corpus is typically used. 
    * These usually use the [Penn Treebank](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html) and [Brown Corpus]((http://www.scs.leeds.ac.uk/ccalas/tagsets/brown.html) tags.
    * Penn Treebank is probably the most common, but both corpora are available with NLTK.

# Lemmatizing

* Similar to stemming, where a word is replaced by its root
+ The major difference that stemming can often create non-existent words, whereas lemmas are actual words, with same meaning. In that way, its more akin to sysnonym replacement.
+ A lemma is a root word, as opposed to the root stem. 
+ Sometimes it may wind up with a completely different word.
+ Lemmatize takes a part of speech parameter, "pos." If not supplied, the default is "noun."
+ This means that an attempt will be made to find the closest noun, which can create trouble.  

Make sure that you have unzipped the wordnet __corpus__ in 

    nltk_data/corpora/wordnet  
    
This will allow the __WordNetLemmatizer__ class to access WordNet

* The __WordNetLemmatizer__ class is a thin wrapper around the __wordnet corpus__ and uses the __morphy()__ function of the __WordNetCorpusReader__ class to find a lemma.
* If no lemma is found, or the word itself is a lemma, the word is returned as is. 

In [1]:
import pandas as pd
import os
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

    os.getcwd()

    os.setwd("path_to_directory")

In [2]:
df = pd.read_csv("reviews_subset.csv", sep = ',')

In [3]:
df['key_words'] = df['comments']

In [4]:
stop_words = set(stopwords.words("english"))

In [5]:
ps = PorterStemmer()

In [6]:
stemmer = PorterStemmer()
lemmatiser = WordNetLemmatizer()

In [7]:
pd.options.mode.chained_assignment = None  # default='warn'

In [8]:
# Tokenizing text
for t in range (0,len(df)) :
    if pd.notnull(df.key_words[t]):
        x = df.key_words[t]
        words = word_tokenize(x)
        words=[word.lower() for word in words if word.isalpha()]
        filtered_sentence = []

# Removing filler words        
        for w in words:
            if w not in stop_words:
                filtered_sentence.append(w) 

# Lemmatizing filtered words                
        custom_lemmas = []
        for w in filtered_sentence:
            word_tokens = word_tokenize(w)
            for w in word_tokens:
                custom_lemmas.append(lemmatiser.lemmatize(w))

        tag = nltk.pos_tag(custom_lemmas)

# Filtering keywords out of lemmas accoding to pos tagging        
        keyword = []
        count = 0

        while count < len(tag):
            if (tag[count][1] == "JJ" or tag[count][1] == "RB"):
                keyword.append(tag[count][0])
            count = count + 1

        df.key_words[t] = ', '.join(keyword )
df.to_csv("subset_output.csv")

In [9]:
print(df[["comments","key_words"]])

                                              comments  \
0    Very welcoming and pretty house, Kryshana and ...   
1    Kryshana and Mike were excellent hosts. They a...   
2    Exactamente lo que se ve en el anuncio todo en...   
3    Kryshana and Michael  were very hospitable. Th...   
4    Kryshana and Mike are very nice.. \nthe apartm...   
5    Kryshana is really nice and always answered ou...   
6    We loved staying here.  Kryshana and Michael w...   
7    Kryshana and Mike were very cool, friendly, ni...   
8                Very accommodating and sweet couple.    
9    Michael was welcoming - a good communication a...   
10   Complessivamente il soggiorno è andato bene. K...   
11   Kryshana and Mike were very accommodating host...   
12   Amazing!! The emplacement is super convenient,...   
13   Great spot! The apartment has everything you n...   
14   Only issue was getting there initially, exact ...   
15   Safe area and Conformable accommodations. Abou...   
16   Overall I