In [456]:
!pip install wordcloud



In [457]:
import pandas as pd
import numpy as np

import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
from nltk.tokenize import word_tokenize
from nltk.tokenize import sent_tokenize
nltk.download('punkt')
nltk.downloader.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from collections import Counter
from wordcloud import WordCloud
from nltk.corpus import stopwords

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /Users/tanmaysharma/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Reading Dataset

In [458]:
df = pd.read_json('Downloads/results.json')
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,URL,Rating,ReviewDetails
0,Productive,"Good company, cool workplace, work load little...",https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Current Employee) - Ghansoli - August 30,..."
1,Stressful,1. Need to work on boss's whims and fancies 2....,https://in.indeed.com/cmp/Reliance-Industries-...,3,"(Former Employee) - - August 26, 2021"
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Former Employee) - - August 17, 2021"
3,Productive,I am just pass out bsc in chemistry Typical da...,https://in.indeed.com/cmp/Reliance-Industries-...,5,"(Current Employee) - - August 17, 2021"
4,Non productive,Not so fun at work just blame games Target pe...,https://in.indeed.com/cmp/Reliance-Industries-...,1,"(Former Employee) - - August 9, 2021"


## Feature Engineering

1. We can see that in review details, we have three things, the employee's current status, the place where he lives, and then the date the review was made. So, we will take out those details and then delete the Review Details column.

2. We can see that in the URL, we can extract the company of the employee. So, we will extract the company from the URL and then delete URL column as well. 

In [459]:
def extract_position(data):
    return data.split(" - ")[0]

In [460]:
def extract_place(data):
    return data.split(" - ")[1]

In [461]:
def extract_date(data):
    return data.split(" - ")[2]

#### Place Column

In [462]:
df['Place'] = df['ReviewDetails'].apply(extract_place)

In [463]:
df['Place'][df['Place'] == ' '].count().sum()

129942

As we can see that many of the places are missing. We can assume that the person writing this review opted not to give the location. Let's see the other location dataset, how does it look like.

In [464]:
df['Place'].value_counts()

                                      129942
 India                                  3319
 Bangalore Urban, Karnataka              989
 india                                   767
 HP                                      275
                                       ...  
 Opp to VOC ground, Palayamkottai          1
 Janakpuri Block B 1, Delhi                1
 Sion, Mumbai, Maharashtra                 1
 pondycheery                               1
 Nerul, India                              1
Name: Place, Length: 3781, dtype: int64

As we can see that we have Data in a very unstructed way. The location is very unique in some cases and in some cases, it's just the country. Eg. banglore has 989, and India has 3319 separately, which makes this column extremely hard to work with and useless in the long term because of no structureness. So, we remove this column.

In [465]:
df.drop(columns = 'Place', axis = 1, inplace = True)

#### Position Column

In [466]:
df['Position'] = df['ReviewDetails'].apply(extract_position)

Let's remove the brackets from the position column

In [467]:
df['Position'] = df['Position'].str.strip('( )')

In [468]:
df['Position']

0         Current Employee
1          Former Employee
2          Former Employee
3         Current Employee
4          Former Employee
                ...       
145204     Former Employee
145205     Former Employee
145206     Former Employee
145207     Former Employee
145208     Former Employee
Name: Position, Length: 145209, dtype: object

#### Date Column

In [469]:
df['Date'] = df['ReviewDetails'].apply(extract_date)

#### Removing Review Details

In [470]:
df.drop(columns = 'ReviewDetails', axis = 1, inplace = True)

#### Extracting Company Names

In [471]:
def extract_company(s):
    s = s[26:]
    s = s.split('/')[0]
    return s

In [472]:
df['Company'] = df['URL'].apply(extract_company)

In [473]:
df['Company'].value_counts()

Tata-Consultancy-Services-(tcs)    14441
IBM                                10820
Infosys                            10696
Accenture                          10137
Cognizant-Technology-Solutions      9626
Hdfc-Bank                           6749
Capgemini                           5248
Amazon.com                          3385
L&T-Technology-Services-Ltd.        3226
Concentrix                          3162
Axis-Bank                           3125
HSBC                                3117
HP                                  3115
Hinduja-Global-Solutions            3060
Dell-Technologies                   2778
Deloitte                            2671
Bharti-Airtel-Limited               2353
Mphasis                             2269
Teleperformance                     2175
Oracle                              2076
Ey                                  1983
Kotak-Mahindra-Bank                 1910
Sutherland                          1896
Ericsson                            1852
Wns-Global-Servi

We drop the URL column now.

In [474]:
df.drop(columns = 'URL', axis = 1, inplace = True)

In [475]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,Rating,Position,Date,Company
0,Productive,"Good company, cool workplace, work load little...",3,Current Employee,"August 30, 2021",Reliance-Industries-Ltd
1,Stressful,1. Need to work on boss's whims and fancies 2....,3,Former Employee,"August 26, 2021",Reliance-Industries-Ltd
2,Good Company for Every employee,"Good company for every Engineers dream, Full M...",5,Former Employee,"August 17, 2021",Reliance-Industries-Ltd
3,Productive,I am just pass out bsc in chemistry Typical da...,5,Current Employee,"August 17, 2021",Reliance-Industries-Ltd
4,Non productive,Not so fun at work just blame games Target pe...,1,Former Employee,"August 9, 2021",Reliance-Industries-Ltd


## Missing Values

In [476]:
df.isna().sum()

ReviewTitle       0
CompleteReview    0
Rating            0
Position          0
Date              0
Company           0
dtype: int64

No missing values!!!

## Data Preprocessing

1. We need to process the complete review, and covert it into something that can be processed better.
2. We will convert position into numeric.
3. Checking about the review title class and altering it if necessary.

#### Complete Review

In [477]:
# we will first create a copy column for refrence later on

df['CompleteReview_ref'] = df['CompleteReview']

In [478]:
# removing tags from reviews 

def remove_tags(text):
  remove = re.compile(r'')
  return re.sub(remove, '', text)
  
df['CompleteReview'] = df['CompleteReview'].apply(remove_tags)

In [479]:
# removing special characters from reviews

def special_char(text):
  sample_text = ''
  for x in text:
    if x.isalnum():
      sample_text = sample_text + x
    else:
      sample_text = sample_text + ' '
  return sample_text

df['CompleteReview'] = df['CompleteReview'].apply(special_char)

In [480]:
# converting the text into lowercase

def convert_lower(text):
   return text.lower()
  
df['CompleteReview'] = df['CompleteReview'].apply(convert_lower)

In [481]:
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    filtered_text = [word for word in text.split() if word.lower() not in stop_words]
    return ' '.join(filtered_text)

df['CompleteReview'] = df['CompleteReview'].apply(remove_stopwords)

In [482]:
# performing lemmatization

def lemmatize_word(text):
    wordnet = WordNetLemmatizer()
    words = nltk.word_tokenize(text)
    lemmatized_words = [wordnet.lemmatize(word) for word in words]
    lemmatized_text = ' '.join(lemmatized_words)
    return lemmatized_text

df['CompleteReview'] = df['CompleteReview'].apply(lemmatize_word)

In [483]:
# perform tokenization

def tokenize_word(text):
    return word_tokenize(text)

df['CompleteReview'] = df['CompleteReview'].apply(tokenize_word)

In [484]:
df['CompleteReview'].head()

0    [good, company, cool, workplace, work, load, l...
1    [1, need, work, bos, whim, fancy, 2, priority,...
2    [good, company, every, engineer, dream, full, ...
3    [pas, bsc, chemistry, typical, day, work, mang...
4    [fun, work, blame, game, target, people, le, t...
Name: CompleteReview, dtype: object

In [485]:
df['CompleteReview'].tail()

145204    [get, lot, learn, company, systematic, follows...
145205    [lot, scope, learn, different, technology, use...
145206    [overall, positive, experience, nice, environm...
145207    [happy, started, career, pretigeous, group, ev...
145208    [got, good, experience, knowledge, work, credi...
Name: CompleteReview, dtype: object

#### Position Data

In [486]:
df['Position'].value_counts()

Former Employee                                                    79193
Current Employee                                                   65493
Officer   (Former Employee                                            93
Officer   (Current Employee                                           74
Employee   (Current Employee                                          19
                                                                   ...  
Training   (Current Employee                                           1
customer satisfaction offcier, credit cards    (Former Employee        1
marketing   (Former Employee                                           1
https://www.indeed.co.in/   (Former Employee                           1
PROGRAMER .........   (Former Employee                                 1
Name: Position, Length: 181, dtype: int64

So, what I want to do is anyone who is a former employee should be in the former employee tag and same with current employees. I want to binary set of data.

In [487]:
df.loc[df['Position'].str.contains('Current Employee'), 'Position'] = 'Current Employee'

In [488]:
df.loc[df['Position'].str.contains('Former Employee'), 'Position'] = 'Former Employee'

In [489]:
df['Position'].value_counts()

Former Employee     79461
Current Employee    65748
Name: Position, dtype: int64

#### Review Title Data

In [490]:
# we will first create a copy column for later refrence

df['ReviewTitle_ref'] = df['ReviewTitle']

In [491]:
df['ReviewTitle'].str.len().max()

150

In [492]:
df['ReviewTitle'].str.len().min()

1

In [493]:
df['ReviewTitle'].str.len().mean()

27.678890426901912

In [494]:
df['ReviewTitle'].str.len().std()

20.368275710963566

Since we can see that the average title is 28 letters and the standard deviation is around 20, we need to preprocess the review title data as well the same way we did to complete review. We did not do that before as we had no idea about these mathematical figures.

In [495]:
df['ReviewTitle'] = df['ReviewTitle'].apply(remove_tags)
df['ReviewTitle'] = df['ReviewTitle'].apply(special_char)
df['ReviewTitle'] = df['ReviewTitle'].apply(convert_lower)
df['ReviewTitle'] = df['ReviewTitle'].apply(remove_stopwords)
df['ReviewTitle'] = df['ReviewTitle'].apply(lemmatize_word)
df['ReviewTitle'] = df['ReviewTitle'].apply(tokenize_word)

In [496]:
df['ReviewTitle'].head()

0                        [productive]
1                         [stressful]
2    [good, company, every, employee]
3                        [productive]
4                   [non, productive]
Name: ReviewTitle, dtype: object

In [497]:
df['ReviewTitle'].tail()

145204       [definitely, good, place, work, lot, learning]
145205        [service, company, great, scope, improvement]
145206    [productive, fun, work, great, place, certific...
145207                        [great, place, start, career]
145208                                  [nice, place, work]
Name: ReviewTitle, dtype: object

Good, our work with the data is done. Now, we can actually start doing some work.

## Sentiment Ananlysis using Sentiment Intensity Analyzer

Now, we will give the sentiment score to each of the reviews in order to try and access the kind of the culture that goes around in the specified company. 

We will use  VADER (Valence Aware Dictionary and Sentiment Reasoner) intensity analyzer to determine the score of the text.

In [498]:
def get_overall_sentiment(tokens):
    analyzer = SentimentIntensityAnalyzer()
    sentiment_scores = []
    for token in tokens:
        score = analyzer.polarity_scores(token)
        sentiment_scores.append(score)
    compound_scores = [score['compound'] for score in sentiment_scores]
    overall_sentiment = sum(compound_scores) / len(compound_scores)
    return overall_sentiment

In [499]:
df['sentiment_score'] = (df['ReviewTitle'] + df['CompleteReview']).apply(get_overall_sentiment)

Let's compare the first 5 titles to the score to see if it's correct or not.

In [500]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,Rating,Position,Date,Company,CompleteReview_ref,ReviewTitle_ref,sentiment_score
0,[productive],"[good, company, cool, workplace, work, load, l...",3,Current Employee,"August 30, 2021",Reliance-Industries-Ltd,"Good company, cool workplace, work load little...",Productive,0.085186
1,[stressful],"[1, need, work, bos, whim, fancy, 2, priority,...",3,Former Employee,"August 26, 2021",Reliance-Industries-Ltd,1. Need to work on boss's whims and fancies 2....,Stressful,-0.024795
2,"[good, company, every, employee]","[good, company, every, engineer, dream, full, ...",5,Former Employee,"August 17, 2021",Reliance-Industries-Ltd,"Good company for every Engineers dream, Full M...",Good Company for Every employee,0.082558
3,[productive],"[pas, bsc, chemistry, typical, day, work, mang...",5,Current Employee,"August 17, 2021",Reliance-Industries-Ltd,I am just pass out bsc in chemistry Typical da...,Productive,0.051812
4,"[non, productive]","[fun, work, blame, game, target, people, le, t...",1,Former Employee,"August 9, 2021",Reliance-Industries-Ltd,Not so fun at work just blame games Target pe...,Non productive,-0.024957


## Overall Score of the Company

Since we can see that we have a rating for each review, we will create a new column called overall_score, in which we will average out the sentiment_score with the rating.

In [501]:
df['Overall_Score'] = abs(df['sentiment_score'])*df['Rating']

In [502]:
df['Overall_Score'].head()

0    0.255559
1    0.074384
2    0.412792
3    0.259059
4    0.024957
Name: Overall_Score, dtype: float64

In [503]:
df.head()

Unnamed: 0,ReviewTitle,CompleteReview,Rating,Position,Date,Company,CompleteReview_ref,ReviewTitle_ref,sentiment_score,Overall_Score
0,[productive],"[good, company, cool, workplace, work, load, l...",3,Current Employee,"August 30, 2021",Reliance-Industries-Ltd,"Good company, cool workplace, work load little...",Productive,0.085186,0.255559
1,[stressful],"[1, need, work, bos, whim, fancy, 2, priority,...",3,Former Employee,"August 26, 2021",Reliance-Industries-Ltd,1. Need to work on boss's whims and fancies 2....,Stressful,-0.024795,0.074384
2,"[good, company, every, employee]","[good, company, every, engineer, dream, full, ...",5,Former Employee,"August 17, 2021",Reliance-Industries-Ltd,"Good company for every Engineers dream, Full M...",Good Company for Every employee,0.082558,0.412792
3,[productive],"[pas, bsc, chemistry, typical, day, work, mang...",5,Current Employee,"August 17, 2021",Reliance-Industries-Ltd,I am just pass out bsc in chemistry Typical da...,Productive,0.051812,0.259059
4,"[non, productive]","[fun, work, blame, game, target, people, le, t...",1,Former Employee,"August 9, 2021",Reliance-Industries-Ltd,Not so fun at work just blame games Target pe...,Non productive,-0.024957,0.024957


## Scoring Company Cultures

1. We will give an average score by their sentiment score average to the companies.
2. We will give an average overall score by our sentiment score and overall score.
2. We will give top 5 positive reviews based on both scores.
3. We will give top 5 negative reviews based on both scores.
4. We will give the most used words to describe the company.

In [504]:
companies = df['Company'].unique()

In [505]:
companies

array(['Reliance-Industries-Ltd', 'Mphasis', 'Kpmg', 'Yes-Bank',
       'Sutherland', 'Marriott-International,-Inc.', 'DHL', 'Jio',
       'Vodafoneziggo', 'HP', 'Maersk', 'Ride.swiggy', 'Jll', 'Alstom',
       'UnitedHealth-Group', 'Tata-Consultancy-Services-(tcs)',
       'Capgemini', 'Teleperformance', 'Cognizant-Technology-Solutions',
       'Mahindra-&-Mahindra-Ltd', 'L&T-Technology-Services-Ltd.',
       'Bharti-Airtel-Limited', 'Indeed', 'Hyatt',
       'Icici-Prudential-Life-Insurance', 'Accenture', 'Honeywell',
       'Standard-Chartered-Bank', 'Nokia', 'Apollo-Hospitals',
       'Tata-Aia-Life', 'Hdfc-Bank', 'Bosch', 'Deloitte', 'Ey',
       'Microsoft', 'Barclays', 'JPMorgan-Chase', 'Muthoot-Finance',
       'Wns-Global-Services', 'Kotak-Mahindra-Bank', 'Infosys', 'Oracle',
       "Byju's", 'Deutsche-Bank', 'Hinduja-Global-Solutions', 'Ericsson',
       'Axis-Bank', 'IBM', 'Concentrix', 'Wells-Fargo', 'Google',
       'Dell-Technologies', 'Facebook', 'Amazon.com', 'Flipkart.

In [506]:
df['Date'] = df['Date'].str.lstrip()

In [507]:
def average_sentiment_score(company):
    return df['sentiment_score'][df['Company'] == company].mean()

In [508]:
def average_overall_score(company):
    return df['Overall_Score'][df['Company'] == company].mean()

In [532]:
def top_5_reviews_good(company):
    company_df = df[df["Company"] == company]
    top_5 = company_df.nlargest(5, "sentiment_score")
    
    print("Top 5 on basis of Sentiment Score\n\n")
    
    for index, row in top_5.iterrows():
        print(f"Title: {row['ReviewTitle_ref']}")
        print(f"Description: {row['CompleteReview_ref']}")
        print()
        
    print('\n\n')    
    print("Top 5 on basis of Overall Score\n\n")
    
    top_5 = company_df.nlargest(5, "Overall_Score")
    for index, row in top_5.iterrows():
        print(f"Title: {row['ReviewTitle_ref']}")
        print(f"Description: {row['CompleteReview_ref']}")
        print()
    print('\n\n') 

In [533]:
def top_5_reviews_bad(company):
    company_df = df[df["Company"] == company]
    top_5 = company_df.nsmallest(5, "sentiment_score")
    
    print("Bottom 5 on basis of Sentiment Score\n\n")
    
    for index, row in top_5.iterrows():
        print(f"Title: {row['ReviewTitle_ref']}")
        print(f"Description: {row['CompleteReview_ref']}")
        print()
        
    print('\n\n')    
    print("Bottom 5 on basis of Overall Score\n\n")
    
    top_5 = company_df.nsmallest(5, "Overall_Score")
    for index, row in top_5.iterrows():
        print(f"Title: {row['ReviewTitle_ref']}")
        print(f"Description: {row['CompleteReview_ref']}")
        print()
    print('\n\n') 

In [534]:
def generate_word_cloud(df, company_name):
    
    dict_words = {}
    for i in df['CompleteReview'][df['Company'] == company_name]:
        for j in i:
            if j not in dict_words:
                dict_words[j] = 1
            else:
                dict_words[j] += 1
    
    sorted_dict = {k: v for k, v in sorted(dict_words.items(), key=lambda item: item[1], reverse=True)}

    count = 0
    for key, value in sorted_dict.items():
        if count < 5:
            print(key, value)
            count += 1
        else:
            break

In [535]:
def company_info(company_name):
    print("Sentiment Score : ", average_sentiment_score(company_name))
    print("Overall Score : ", average_overall_score(company_name))
    print("Top 5 Good Reviews : ", top_5_reviews_good(company_name))
    print("Top 5 Bad Reviews : ", top_5_reviews_bad(company_name))
    print("Top 5 word used : ", generate_word_cloud(df, company_name))

In [536]:
company_info(companies[3])

Sentiment Score :  0.07722220517227033
Overall Score :  0.3242340981486553
Top 5 on basis of Sentiment Score


Title: It's good
Description: Yes Bank is the Good place for work. I enjoyed my work in yes Bank and good management . Yes bank branch people's they supporting me.thanks to yes bank giving the opportunity ....​

Title: Excellent Experience
 Staff is excellent,supportive and understanable

Title: Excellent
Description: Better place to work and environment is awesome for work. Here you will get colleagues support that helps to grow here easily. And also you will get super bosses help.

Title: productive and fun
 Rewards and Recognition is one of the best part.

Title: Great Experience
 I have been blessed to start my career with Yes Bank...




Top 5 on basis of Overall Score


Title: Excellent Experience
 Staff is excellent,supportive and understanable

Title: Excellent
Description: Better place to work and environment is awesome for work. Here you will get colleagues support t

In [537]:
company_info(companies[6])

Sentiment Score :  0.07405923263550536
Overall Score :  0.3227425510385577
Top 5 on basis of Sentiment Score


Title: Great place to work
 Great rewards and recognition

Title: Great place to work
 Respect and reward.

Title: Challenging work atmosphear
Description: A good working environment with good team support. Greatest achievement is getting award from management for "Good Customer Satisfaction".

Title: Fun place and committed people
 Nice supportive co workers.

Title: Good
Description: nice place to work. Better work culture.Salary benefits are good.Management is genuine and friendly. job security is as like as any other Corporate companies.




Top 5 on basis of Overall Score


Title: Great place to work
 Respect and reward.

Title: Challenging work atmosphear
Description: A good working environment with good team support. Greatest achievement is getting award from management for "Good Customer Satisfaction".

Title: Great place to work
 Great rewards and recognition

Title: 

In [539]:
company_info(companies[0])

Sentiment Score :  0.07464897416564777
Overall Score :  0.3239655245592988
Top 5 on basis of Sentiment Score


Title: Best facility.
 Very Innovative & Creative.

Title: Great Corporate
Description: Great company with high ethical values, good work environment,great exposure.

Title: it is good..i like to join in this company
Description: yes ...its a best opportunity to join in this company...but we have to work hard...because there is no shortcut to success...........😊😊😊😊😊😊😊😊.

Title: Nice👏
 In reliance mean work is very good and all senior are very helpful they provide good support, encourage, motivation and very helpful

Title: excellent
Description: One of the best industries I have worked with. It provides all kinds of opportunities for an individual.




Top 5 on basis of Overall Score


Title: it is good..i like to join in this company
Description: yes ...its a best opportunity to join in this company...but we have to work hard...because there is no shortcut to success.........

In [540]:
company_info(companies[43])

Sentiment Score :  0.0490399587693079
Overall Score :  0.22227567326612335
Top 5 on basis of Sentiment Score


Title: Productive
Description: Good place and good growth opportunities Team spirit Enthusiastic environment Co operative people Collaborative  Good incentive structure Fun loving

Title: Great place to work in
 2. Amazing opportunities

Title: Excellent
Description: Good culture great training managers very good short term growth product is very good,employees are very friendly,all managers helped lot during and after training, very polite and professional,very good benefits

Title: Good
Description: Good atmosphere,  good package,  nice experience,  everybody should work there to know the excellent work atmosphere.  Relaxed and cool place. Growth in career

Title: Friendly
 Great incentives and international tours for performers




Top 5 on basis of Overall Score


Title: Great place to work in
 2. Amazing opportunities

Title: Excellent
Description: Good culture great trai

Thanks for reading the kernel! 