# CS6109 - Compiler Design Project

**Project Topic: Social Recommendations with multiple influence from Direct User Interactions**

Done by:

    Vivek Ramkumar, 2018103082
    Kariketi Tharun Reddy, 201803034
    G. R. Srikanth, 2018103603
                                                    
    Third Years studying B.E, Computer Science, at College of Engineering, Guindy
    Chennai, Tamil Nadu, India

# Importing Standard Libraries:

In [1]:
import numpy as np
import pandas as pd 
from collections import OrderedDict
import gc
import random

import os

# Obtaining Data:

We are using three datasets of products, from which recommendations will be made.

1. Groceries Dataset
2. Flipkart Product Database
3. Consumer Reviews of Amazon Products

While extracting data from the respective databases, we need to ensure there are no duplicate items or NaN values.

So, we use OrderedDict.fromkeys() to remove all duplicate items. We then check whether any NaN values exist 
in the list of products. If there are NaN values, we promptly exclude them.

Finally, we delete the database as it occupies a lot of space in the RAM.
gc.collect() is a garbage collection function used to clear up any extra leftovers from the deletion.

This process is repeated for all the three databases.

In [2]:
groceries_db = pd.read_csv('groceries - groceries.csv')

new_groceries_db = groceries_db.drop(columns = ['Item(s)'])

temp = []

for i in range(len(new_groceries_db.columns)):
    for x in new_groceries_db.iloc[:, i]:
        temp.append(x)
            
groceries = [] 

groceries = list(OrderedDict.fromkeys(temp))

groceries = [item for item in groceries if str(item) != 'nan']

del groceries_db, new_groceries_db
gc.collect()

0

In [3]:
flipkart_db = pd.read_csv('flipkart_com-ecommerce_sample.csv')

temp = flipkart_db['product_name']

flipkart_items = [] 

flipkart_items = list(OrderedDict.fromkeys(temp))

flipkart_items = [item for item in flipkart_items if str(item) != 'nan']

del flipkart_db
gc.collect()

0

In [4]:
def read_amzn_prod_dbs(file_path):
    
    temp_db = pd.read_csv(file_path)

    temp = temp_db['name']
    
    return temp
    
def append_list_to_list(orig_list, new_list):
    
    for x in orig_list:
        new_list.append(x)

temp_1 = read_amzn_prod_dbs('1429_1.csv')
temp_2 = read_amzn_prod_dbs('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products.csv')
temp_3 = read_amzn_prod_dbs('Datafiniti_Amazon_Consumer_Reviews_of_Amazon_Products_May19.csv')

temp = []

append_list_to_list(temp_1, temp)
append_list_to_list(temp_2, temp)
append_list_to_list(temp_3, temp)

amazon_items = [] 

amazon_items = list(OrderedDict.fromkeys(temp))

amazon_items = [item for item in amazon_items if str(item) != 'nan']

  if self.run_code(code, result):


We combine all the obtained lists of products into a singular mega-list.
    
This list will be referred to for recommending products to the users.

In [5]:
complete_list = []

append_list_to_list(flipkart_items, complete_list)
append_list_to_list(groceries, complete_list)
append_list_to_list(amazon_items, complete_list)

In [6]:
print("The total number of products available to recommend :", len(complete_list))

The total number of products available to recommend : 12970


**The mega-list contains over 12,000 product names. This is a sizeable amount, enough to get different recommendations on every run.**

# Lexical Analysis:

The main focus of lexical analysis is to obtain tokens from the given text input.

We need to differentiate between entities (users, organizations, religion, etc.) and interests (garments, tech, etc.).

The technique we are going to employ is called POS tagging. (POS -> Part of Speech)
    
POS tagging algorithms allocate tags to each token, such as "NOUN", "VERB", "ADJ", etc., based on the similarity to their respective tagged categories.

We will be using the Spacy library, which contains pre-trained NLP models and a huge dictionary of words.

The input is parsed sentence by sentence, from which nouns, verbs, users, and user types are obtained. The respective information is summarized in a pandas dataframe. You will see it in the 'Input and Output' section.

In [7]:
import nltk
import spacy

from spacy import displacy

from nltk.corpus import stopwords

nltk.download('stopwords')

from nltk.tokenize import word_tokenize

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\crack\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [8]:
def tokenizer(text, nouns, verbs, users, user_types):
    
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    
    for token in doc:
        if token.pos_ == 'NOUN':
            nouns.append(str(token))

        if token.pos_ == 'VERB':
            verbs.append(str(token))

    for entity in doc.ents:
        users.append(entity.text)
        user_types.append(entity.label_)
    
    nouns = [word for word in nouns if not word in stopwords.words()]

Keep in mind the input is made to understand short conversations - especially one-liners.

Basically, we separate the text blob by the '\n', or newline symbol.
This is done by using the splitlines() function.

If any respective dialogue consists of two lines, the parser will be confused and allocate an extra user to the extra line.

This is done with consideration to social messaging applications, as users frequently communicate with short dialogues.

In [9]:
def get_tokens(text):
    
    conversation = text.splitlines()

    token_summary = pd.DataFrame({'Users': [], 'Nouns': [], 'Verbs': [], 'User_Types': []})

    for dialogue in conversation:
    
        nouns = []
        verbs = []
        users = []
        user_types = []

        tokenizer(dialogue, nouns, verbs, users, user_types)
    
        nouns = list(OrderedDict.fromkeys(nouns))
        verbs = list(OrderedDict.fromkeys(verbs))
    
        token_summary = token_summary.append({'Users': users, 'Nouns': nouns, 'Verbs': verbs, 'User_Types': user_types}, ignore_index=True)
    
    return conversation, token_summary

# Syntax Analysis:

The main point of understanding the syntax of the sentence is to check the order by which tokens are parsed.

For example, seeing a "not" before a verb, say "enjoying", can imply a negative sentiment.

Sentiment analysis is an important tool in recommendation - it helps recommend products to those who want it.

People who express negative sentiment towards a product will not recieve recommendations on it.
    
Instead, they will get recommendations based on what their friends like.

In [10]:
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

def get_sentiments(conversation):

    sentiments = pd.DataFrame({'Dialogue': [], 'Compound': [], 'Negative': [], 'Neutral': [], 'Positive': []})

    sid = SentimentIntensityAnalyzer()

    for dialogue in conversation:
    
        ss = sid.polarity_scores(dialogue)
    
        sentiments = sentiments.append({'Dialogue': dialogue, 'Compound': ss['compound'], 'Negative': ss['neg'], 'Neutral': ss['neu'], 'Positive': ss['pos']}, ignore_index = True)
        
    return sentiments

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\crack\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


The NTLK Vader Sentiment Analyzer is an excellent tool for quickly analyzing the sentiment of a sentence.

We will obtain the polarity scores in each sentiment category, i.e "Positive", "Negative", "Neutral", and "Compound", for each user.

A summary of the sentiments expressed by each user will be shown in the 'Input and Output' section.

In [11]:
def display_dependency_tree(conversation):

    nlp = spacy.load("en_core_web_sm")

    for dialogue in conversation:
    
        doc = nlp(dialogue)

        displacy.render(doc, style="dep")

The Spacy library consists of one more feature - the ability to show dependency trees.

Dependency trees are useful in understanding the parsing process, in pursuit of finding the sentiments.

You can see the respective dependency trees for each user in the "Input and Output" section.

# Recommendation(Semantic):

This is the recommendation section.

We will check the list of nouns from our summary of tokens to see what the interests of the users are.

If the particular noun is present in any product name, we will store that product name in a new list.

This new list is considered as our initial set of recommendations.

In [12]:
def create_init_rec(token_summary):

    rec_1 = []

    for i in range(len(token_summary['Nouns'])):
    
        sublist = []
    
        for x in token_summary['Nouns'][i]:
        
            for y in complete_list:
            
                if x.lower() in y.lower():
                    sublist.append(y)
        
        rec_1.append(sublist)
        
    return rec_1

Now, our initial recommendations are quite a large number.

We will pick five products at random, to recommend to each user.

Of course, we do not want to get the same products again in our random selection.

This is easily achieved by using random.sample(list, k), where k is the number of items selected. This function also ensures that there is no repetition in the selection of items.

In [13]:
def get_actual_rec(rec_1):

    recommendations = []

    for i in range(len(rec_1)):
    
        sublist = random.sample(rec_1[i], 5)
    
        recommendations.append(sublist)
        
    return recommendations

In [14]:
def mixed_rec(rec, index):
    
    temp = []
    
    for i in range(len(rec)):
        
        if(i==index):
            if(i == len(rec)-1):
                temp.append(random.choice(rec[i-1]))
            else:
                temp.append(random.choice(rec[i+1]))
        else:
            temp.append(random.choice(rec[i]))
        
    return temp

As stated before, the sentiment of the user is considered as a factor in the recommendation process.

The compounded polarity score is considered here.

If a user's sentiment is negative, then we pick a random recommendation based on the interests of that user's friends.

Otherwise, if it is positive or neutral, we can directly recommend based on their individual interest.

In [15]:
def print_rec():

    print("Recommendations are: \n")

    i = 0

    for u in token_summary['Users']:
            
        for x in u:
            
            if(sentiments['Compound'][i] > 0):     
                print(x, '->', recommendations[i])
            elif(sentiments['Compound'][i] < 0):
                print(x, '->', mixed_rec(recommendations, i))
            else:
                print(x, '->', recommendations[i])
        
            print()
            i = i + 1 

# Input and Output

**This is the input text.**

**Every time you want to get some new recommendations, tweak this and run the cells below it.**

In [16]:
text = """ Rahul : I like to play on my tablet.
 Gokul : My dog does not enjoy using dog shampoo.
 Ankita : I love to wear a kurta.
 Badrinath : I frequently wear a shirt.
 Vijay : I am searching for a nice watch. 
 Srikanth : I like to play cricket."""

In [17]:
conversation, token_summary = get_tokens(text)

In [18]:
token_summary

Unnamed: 0,Users,Nouns,Verbs,User_Types
0,[Rahul],[tablet],"[like, play]",[PERSON]
1,[Gokul],"[dog, shampoo]","[enjoy, using]",[PERSON]
2,[Ankita],[kurta],"[love, wear]",[PERSON]
3,[Badrinath],[shirt],[wear],[PERSON]
4,[Vijay],[watch],[searching],[PERSON]
5,[Srikanth],[cricket],"[like, play]",[PERSON]


In [19]:
sentiments = get_sentiments(conversation)

In [20]:
sentiments

Unnamed: 0,Dialogue,Compound,Negative,Neutral,Positive
0,Rahul : I like to play on my tablet.,0.5994,0.0,0.505,0.495
1,Gokul : My dog does not enjoy using dog shampoo.,-0.3875,0.247,0.753,0.0
2,Ankita : I love to wear a kurta.,0.6369,0.0,0.488,0.512
3,Badrinath : I frequently wear a shirt.,0.0,0.0,1.0,0.0
4,Vijay : I am searching for a nice watch.,0.4215,0.0,0.641,0.359
5,Srikanth : I like to play cricket.,0.5994,0.0,0.38,0.62


Shown below are the dependency trees for each sentence.

The arrows infer dependencies between words. If an arrow points from 'a' to 'b', then it can be understood that 'a' depends on 'b'.

This is useful in seeing how the tokens are tagged, and how the sentence is syntactically analyzed.

In [21]:
display_dependency_tree(conversation)

This is the initial list of recommendations.

We will be picking random items, based on the user's interests, to recommend each user.

In [22]:
rec_1 = create_init_rec(token_summary)
rec_1

[['cm key 354 Wired USB Tablet Keyboard',
  'Shortkut enterprises Model no 400 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 456 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 428 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 483 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 476 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 477 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 497 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 460 Mobile/Tablet Speaker',
  'Shortkut enterprises Model no 434 Mobile/Tablet Speaker',
  'couponsmall key 343 Wired USB Tablet Keyboard',
  "Orientel Universal 360' Rotation Rubber Suction Cup Car Mount Stand Holder ABS Colorful For Tablet PC & Mobile Phone",
  'i-Static Universal Car Tablet Holder',
  'Autosun OK Stand Smartphone & Tablet Stand',
  'Smartpro 10.5v,2.9a for Tablet Charger 75 Adapter',
  'Generix pack of 5 Micro USB On-the-go For Mobile Phones & Tablets OTG Cable',
  'Generix

This is the final recommendation.

You can see the user's names followed by a list of recommendations based on their interests.
    
Keep in mind that for the most part the recommendations are accurate, but you may chance upon some outliers, due to the random nature of product selection.

In [23]:
recommendations = get_actual_rec(rec_1)
print_rec()

Recommendations are: 

Rahul -> ['Amazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,\r\nAmazon 9W PowerFast Official OEM USB Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,', 'Shortkut enterprises Model no 483 Mobile/Tablet Speaker', 'Amazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,\r\nAmazon 5W USB Official OEM Charger and Power Adapter for Fire Tablets and Kindle eReaders,,,', 'Shortkut enterprises Model no 460 Mobile/Tablet Speaker', 'Fire Kids Edition Tablet, 7 Display, Wi-Fi, 16 GB, Blue Kid-Proof Case']

Gokul -> ['Shortkut enterprises Model no 460 Mobile/Tablet Speaker', 'stylishvilla Embroidered Kurta & Churidar', "FreeHand Solid Men's Straight Kurta", "I-Voc Men's Printed Casual Shirt", 'HMT OLSS 01 Analog Watch  - For Women', 'SM Sway Cricket Ball -   Size: 5,  Diameter: 2.5 cm']

Ankita -> ["FreeHand Solid Men's Straight Kurta", "Jaipurkurti Striped Women's Strai

**You have reached the end of this notebook. Thank you for reading!**