# Capgemini Reviews Dataset

### Introduction

In this lesson, we'll look at some of the reviews of Capgemini.  Let's get started.

### Loading the data

Let's get started by loading our Capgemini data.

In [1]:
import pandas as pd

df = pd.read_csv('./reviews.csv')

In [3]:
reviews = df.to_dict('records')

### Exploring the data

Let's begin by exploring the data.  Remember that our first step can be to explore the grain of the data, and then we can move onto getting an overview of the data.

Begin by selecting a single review and looking at it's keys.

In [7]:
first_review = reviews[0]

first_review.keys()

# dict_keys(['Title', 'Place', 'Job_type', 'Department', 'Date', 'Overall_rating', 'work_life_balance', 'skill_development', 'salary_and_benefits', 'job_security',
# 'career_growth', 'work_satisfaction', 'Likes', 'Dislikes'])

dict_keys(['Title', 'Place', 'Job_type', 'Department', 'Date', 'Overall_rating', 'work_life_balance', 'skill_development', 'salary_and_benefits', 'job_security', 'career_growth', 'work_satisfaction', 'Likes', 'Dislikes'])

And from there, look at the corresponding values.

In [8]:
first_review.values()

dict_values(['Senior Consultant', 'Pune', 'Full Time', 'General Insurance Department', '8 Sep 2023', 4.0, 4.0, 3.0, 3.0, 4.0, 4.0, 4.0, 'Deserved candidates are promoted promptly.\nUnbiased in providing opportunities to employees, regardless of their gender or any other thing', 'With designation promotions good salary increment is also required'])

### Coercing the data

* Now before we get started, let's change the data a bit.  

Create a new list of dictionaries -- it should look just like the old list of reviews, but each should review should have a key of `day`, `month` and `year`, with the day month and year coming from each review's date value.  The date key value pair can be excluded.  Day and year should both be integers.

If there are is a nan for the date, there should still be keys for day, month and year -- but values can be None.

In [90]:
def build_dated_reviews(reviews):
    dated_reviews = []
    for review in reviews:
        if type(review['Date']) == str:
            day, month, year = review['Date'].split()

            dated = {'day': int(day), 'month': month, 'year': int(year)}
        else:
            dated = {'day': None, 'month': None, 'year': None}
        dated_review = {**review, **dated}
        dated_reviews.append(dated_review)
    return dated_reviews

* Now wrap the code in a function.  The function can be called `build_dated_reviews`, and it takes an argument of `reviews` and returns a new list of dictionaries.

In [91]:
len(reviews)

26993

In [92]:
dated_reviews = build_dated_reviews(reviews)
len(dated_reviews)

26993

Now let's look at a single review -- to see how these new keys look.

In [93]:
first_dated_review = dated_reviews[0]
first_dated_review['day'], first_dated_review['month'], first_dated_review['year']

(8, 'Sep', 2023)

* And now let's write a function that provided a review, will just return the specified attributes of that review. 

In [94]:
def build_selected_review(review, attrs):
    return {k:v for k, v in review.items() if k in attrs}

In [95]:
selected_attrs = ['Title', 'Job_type', 'Department', 'Overall_rating',
                  'day', 'month', 'year']

first_selected = build_selected_review(first_dated_review, selected_attrs)

print(first_selected)

# {'Title': 'Senior Consultant', 'Place': 'Pune', 
#  'Job_type': 'Full Time', 'Department': 'General Insurance Department', 'Date': '8 Sep 2023',
#  'Overall_rating': 4.0, 'day': 8, 'month': 'Sep', 'year': 2023}

{'Title': 'Senior Consultant', 'Job_type': 'Full Time', 'Department': 'General Insurance Department', 'Overall_rating': 4.0, 'day': 8, 'month': 'Sep', 'year': 2023}


And now write a function called `build_selected_reviews` that only returns the specified attributes for each of the reviews.

In [96]:
def build_selected_reviews(reviews, attrs):
    return [build_selected_review(review, attrs) for review in reviews]

Ok, so we can see that we have attributes of the job title, location of the job, department, date, and various ratings.

In [97]:
selected_attrs = ['Title', 'Job_type', 'Department', 'Overall_rating',
                  'day', 'month', 'year']

selected_reviews = build_selected_reviews(dated_reviews, selected_attrs)


In [98]:
selected_reviews[0]
# {'Title': 'Senior Consultant',
#  'Job_type': 'Full Time',
#  'Department': 'General Insurance Department',
#  'Overall_rating': 4.0,
#  'day': 8,
#  'month': 'Sep',
#  'year': 2023}


{'Title': 'Senior Consultant',
 'Job_type': 'Full Time',
 'Department': 'General Insurance Department',
 'Overall_rating': 4.0,
 'day': 8,
 'month': 'Sep',
 'year': 2023}

### Exploring the data

Now we can explore some of the data.  The first step is to get an overview of the data.

Begin by finding the number of unique titles across the selected_reviews. 

In [48]:
len(list(set([review['Title'] for review in selected_reviews])))
# 4241

4241

Ok, that's a lot.

Now write a function that given a year, finds only the `selected_reviews` written in that year. 

In [80]:
selected_reviews[0]

[{'Title': 'Senior Consultant',
  'Job_type': 'Full Time',
  'Department': 'General Insurance Department',
  'Overall_rating': 4.0,
  'day': 8,
  'month': 'Sep',
  'year': 2023}]

In [99]:
def reviews_for_year(reviews, year):
    return [review for review in reviews if review['year'] == year]

In [102]:
selected_reviews_2023 = reviews_for_year(selected_reviews, 2023)
len(selected_reviews_2023)

# 6203

6203

Now *use the function above* to create a dictionary that has keys of the years 2017 through 2023, and values of the number of reviews per each year.

In [108]:
amounts = [len(reviews_for_year(selected_reviews, year)) for year in range(2017, 2024)]

In [109]:
years = range(2017, 2024)

In [111]:
amounts_by_year = dict(zip(years, amounts))
print(amounts_by_year)

# {2017: 491, 2018: 4065, 
# 2019: 3103, 2020: 1163, 2021: 2804, 2022: 8086, 2023: 6203}

{2017: 491, 2018: 4065, 2019: 3103, 2020: 1163, 2021: 2804, 2022: 8086, 2023: 6203}


Next create a dictionary of the average `Overall_rating` per year.

In [120]:
import math
math.isnan(float('nan'))

True

In [127]:

hist = {}
for year in range(2017, 2024):
    year_reviews = reviews_for_year(selected_reviews, year)
    total_ratings = [review['Overall_rating'] for review in year_reviews 
                     if not math.isnan(review['Overall_rating'])]
    hist[year] = round(sum(total_ratings)/ len(total_ratings), 2)

In [128]:
hist

# {2017: 3.35,
#  2018: 3.16,
#  2019: 3.48,
#  2020: 3.75,
#  2021: 3.84,
#  2022: 3.98,
#  2023: 3.78}

{2017: 3.35,
 2018: 3.16,
 2019: 3.48,
 2020: 3.75,
 2021: 3.84,
 2022: 3.98,
 2023: 3.78}

> Hint: If you are getting getting nan values, it is likely because there are nan values in some of the reviews.  You can exclude these records by adding something like the following to your code.

```python
import math
if not math.isnan(review['Overall_rating'])
```