# Grammys Project
![](https://www.moviedebuts.com/wp-content/uploads/2021/05/ra_ga_logo.png)

Are you excited to dive into data work for an exciting project at The Recording Academy? You know, the non-profit organization behind the Grammy Awards!

In this project, you'll work on real data from both websites owned by The Recording Academy, the non-profit organization behind the famous Grammy Awards. As you just learned, Ray Starck, the VP of Digital Strategy, decided to split the websites into grammy.com and recordingacademy.com to better serve the Recording Academy's various audience needs.

Now, you are tasked with examining the impact of splitting up the two websites, and analyzing the data for a better understanding of trends and audience behavior on both sites.  

Are you ready?!?!

Let's do this!

![](https://media.giphy.com/media/ZSK6UPKTSLZCKd7orz/giphy.gif)

## Data Dictionary
To start, you will be working with two files, `grammys_live_web_analytics.csv` and `ra_live_web_analytics.csv`.

These files will contain the following information:

- **date** - The date the data was confirmed. It is in `yyyy-mm-dd` format.
- **visitors** - The number of users who went on the website on that day.
- **pageviews** - The number of pages that all users viewed on the website.
- **sessions** - The total number of sessions on the website. A session is a group of user interactions with your website that take place within a given time frame. For example a single session can contain multiple page views, events, social interactions.
- **bounced_sessions** - The total number of bounced sessions on the website. A bounced session is when a visitor comes to the website and does not interact with any pages / links and leaves.
- **avg_session_duration_secs** - The average length for all session durations for all users that came to the website that day.
- **awards_week** - A binary flag if the dates align with marketing campaigns before and after the Grammys award ceremony was held. This is the big marketing push to get as many eyeballs watching the event.
- **awards_night** - The actual night that Grammy Awards event was held.

# Part I - Exploratory Data Analysis

## Task 1

Import the `pandas`,`numpy`, and `plotly.express` libraries.

In [43]:
# Import libraries
import pandas as pd
import numpy as np
import plotly.express as px 

In [44]:
# RUN THIS CELL - DO NOT MODIFY
# this formats numbers to two decimal places when shown in pandas
pd.set_option('display.float_format', lambda x: '%.2f' % x)

## Task 2

Load in the first two files for your analysis. They are the `grammy_live_web_analytics.csv` and `ra_live_web_analytics.csv`.


**A.** For the `grammy_live_web_analytics.csv` file store that into a dataframe called `full_df`

**B.** For the `ra_live_web_analytics.csv` file store that into a dataframe called `rec_academy`

**C.** Preview the dataframes to familiarize yourself with the data.



In [45]:
# Read in dataframes
full_df = pd.read_csv('grammy_live_web_analytics.csv')
rec_academy = pd.read_csv('ra_live_web_analytics.csv')

In [46]:
# preview full_df dataframe
full_df.head(10)

Unnamed: 0,date,visitors,pageviews,sessions,bounced_sessions,avg_session_duration_secs,awards_week,awards_night
0,2017-01-01,9611,21407,10196,6490,86,0,0
1,2017-01-02,10752,25658,11350,7055,100,0,0
2,2017-01-03,11425,27062,12215,7569,92,0,0
3,2017-01-04,13098,29189,13852,8929,90,0,0
4,2017-01-05,12234,28288,12990,8105,95,0,0
5,2017-01-06,11461,28022,12309,7500,95,0,0
6,2017-01-07,11183,27491,11993,7287,94,0,0
7,2017-01-08,16265,38529,17361,10587,84,0,0
8,2017-01-09,17852,40698,18914,11697,87,0,0
9,2017-01-10,14211,33218,15124,9270,96,0,0


In [47]:
# preview rec_academy dataframe
rec_academy.head(10)

Unnamed: 0,date,visitors,pageviews,sessions,bounced_sessions,avg_session_duration_secs,awards_week,awards_night
0,2022-02-01,928,2856,1092,591,148,0,0
1,2022-02-02,1329,3233,1490,923,90,0,0
2,2022-02-03,1138,3340,1322,754,127,0,0
3,2022-02-04,811,2552,963,534,142,0,0
4,2022-02-05,541,1530,602,326,111,0,0
5,2022-02-06,536,1669,610,339,147,0,0
6,2022-02-07,921,3512,1117,567,198,0,0
7,2022-02-08,1106,3662,1296,661,163,0,0
8,2022-02-09,1181,4209,1382,725,150,0,0
9,2022-02-10,1134,3473,1341,727,140,0,0


## Task 3

We all know The Grammy Awards is *the* biggest music event in the music industry, but how many visitors does that bring to the website?

**A.** Create a line chart of the number of users on the site for every day in the `full_df`. See if you can spot the days the Grammys awards are hosted.

In [48]:
# Plot a line chart of the visitors on the site.
px.line(full_df, x='date', y='visitors')

<span style='background :#FFF59E'>**Remark:** The smaller spikes, typically around November/December of each year, are when the nominees are announced.</span>

**B.** What can you say about the visitors to the website by looking at the graph?

**This graph shows that the Grammy's website typically has really low engagement outside of times when the Grammy's are near or trending for things such as nominee annoucements. 
Visitor behavior is heavily impacted by the Grammy's current relevance.**

## Task 4

Let's investigate what an "average" day looks like when the awards show is being hosted versus the other 364 days out of the year.

**A.** Calculate the number of visitors on the site based on the values in the column `awards_night`.

In [49]:
full_df.groupby(by='awards_night').agg('mean')

Unnamed: 0_level_0,visitors,pageviews,sessions,bounced_sessions,avg_session_duration_secs,awards_week
awards_night,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,32388.28,64079.99,35227.83,15231.58,98.43,0.04
1,1389590.23,4227169.92,1737653.08,591708.77,154.23,1.0


**B.** What can you say about these results? Roughly how many more visitors are on the website for the awards ceremony versus a regular day?

**There are approximately 43x more visitors on the website for the awards ceremony versus on a regular day. 
This further supports the point that users' desire to visit the website are influenced by how relevant the Grammy's are in recent news.**

<span style='background :#FFF59E'>**Remark:** This is The Recording Academy's biggest challenge! How do you transform a business that relies on the success of **one** event per year into one that continues to bring users back on the site year round?</span>

## Task 5

When The Recording Academy decided to split their website into two domains, grammy.com and recordingacademy.com, the data capture for grammy.com was not affected. So the `full_df` variable needs to be split separately into two dataframes. The day the domains were switched is on `2022-02-01`.

Create two new dataframes:

1. `combined_site` for all dates before `2022-02-01`
2. `grammys` for all dates after (and including) `2022-02-01`

In [50]:
# Split the data to separate the full_df into two new dataframes.
# One for before the switch of the websites and one for after

combined_site = full_df[full_df['date'] < '2022-02-01']
grammys = full_df[full_df['date'] >= '2022-02-01']


In [51]:
# Run the following cell - DO NOT MODIFY
# .copy() prevents pandas from printing a scary-looking warning message
combined_site = combined_site.copy()
grammys = grammys.copy()

In [52]:
# print the shape of the combined_site dataframe
print(combined_site.shape)

(1857, 8)


<span style='background :#DDD5F3'>If done correctly, the `combined_site` dataframe should have a total of `1857` rows and `8` columns</span>

# Part II - It's all about KPIs

There are certain key performance indicators (KPIs) of interest for The Recording Academy. Let's investigate those a little more.

## Task 6

**A.** Create a new list called `frames` that has the `combined_site`, `rec_academy`, and `grammys` dataframes as entries. 


In [53]:
# create the list of dataframes

frames = [combined_site, grammys, rec_academy]
frames_dict = {'combined site': combined_site, 'Grammys':grammys,'Recording Academy':rec_academy}

**B.** For each frame in the frames list, create a new column `pages_per_session`. This new column is the average number of pageviews per session on a given day. The higher this number the more "stickiness" your website has with your visitors.



In [54]:
# create the `pages_per_session` column for all 3 dataframes.
for frame in frames:
    frame['pages_per_session'] = frame['pageviews'] / frame['sessions']

**C.** Visualize this new metric using a line chart for each site. (You will have 3 separate graphs)

In [55]:
# combined_site graph
px.line(combined_site, x='date', y='pages_per_session')

In [56]:
combined_site['pages_per_session'].median()

1.5728492501973166

In [57]:
# grammys graph
px.line(grammys, x='date', y='pages_per_session')

In [58]:
grammys['pages_per_session'].median()

2.1043788961532615

In [59]:
# rec_academy graph
px.line(rec_academy, x='date', y='pages_per_session')

In [60]:
rec_academy['pages_per_session'].median()

2.885383806519453

**D.** Looking at the 3 charts above, what can you say about the `pages_per_session` when the websites were combined versus after they were split?

<span style='background :#FFF59E'>**Note:** Any large spikes in the data that do not correspond with the Grammy Awards Ceremony can be attributed to abnormalities in the data collection process and ignored in your analysis.</span>

**While the Grammy's and Recording Academy had a combined website, they typically did not see more than 3 pages viewed per session. However, with their separate websites, both the Grammy's and the Recording Academy have seen higher typical ranges in their pages viewed per session. The median pages viewed per session for the separated sessions are higher than the median d the combined site. It is also of note that the user behavior between the two sites is different. On the Grammy's website there is the expected spike of activity earlier in the year and pretty low activity otherwise. On the Recording Academy website, there is a slight increase in activity in August that maintains for around 8 months.**

## Task 7

Bounce rate is another important metric for The Recording Academy. Bounce Rate is a measure of the percentage of visitors who come to the site and *never  interact with the website and leave*. In this task, you will define a function that takes in a dataframe as input and outputs the bounce rate.

**A.** Create a function called `bounce_rate` that:

1. Takes in a `dataframe` as input
2. adds up all of the values in the `bounced_sessions` column and stores in a variable called `sum_bounced`
3. adds up all of the values in the `sessions` column and stores it in a variable called `sum_sessions`
4. returns `100 * sum_bounced / sum_sessions`



In [61]:
def bounce_rate(dataframe):
    '''
    Calculates the bounce rate for visitors on the website.
    input: dataframe with bounced_sessions and sessions columns
    output: numeric value from bounce rate
    '''
    
    sum_bounced= dataframe['bounced_sessions'].sum()
    sum_sessions = dataframe['sessions'].sum()
    return 100 * sum_bounced/sum_sessions




**B.** Use the `frames` variable from Task 6 to loop over each website (represented by a dataframe) to calculate the bounce rate. Print the bounce rate for each site.


In [62]:
# Calculate the Bounce Rate for each site. Use the frames list you created in Task 6.
for frame in frames_dict:
    x = bounce_rate(frames_dict[frame])
    print(f'The bounce rate for the {frame} is {x:0.2f}')

The bounce rate for the combined site is 41.58
The bounce rate for the Grammys is 40.16
The bounce rate for the Recording Academy is 33.67


<span style='background :#DDD5F3'>If done correctly, the `combined_site` and `grammys` site will each have bounce rates in the low 40s. The `rec_academy` will have a bounce rate in the low 30s</span>

**C.** Another useful metric is how long on average visitors are staying on the website.

Calculate the `mean` of the `avg_session_duration_secs` for each of the sites.
Print each one using an f-string.

In [63]:
# Calculate the average of the avg_session_duration_secs. Use the frames list you created in Task 6.
for frame in frames_dict:
    x = frames_dict[frame]['avg_session_duration_secs'].mean()
    print(f'The mean average session duration in seconds for the {frame} is {x:0.2f}')


The mean average session duration in seconds for the combined site is 102.85
The mean average session duration in seconds for the Grammys is 82.99
The mean average session duration in seconds for the Recording Academy is 128.50


**D.** What can you say about these two metrics as it relates to each of the websites?

**The Grammys and Recording Academy Websites individually both have lower bounce rates than the combined site did, meaning more people are interacting with the separated sites than they did with the combined site, even though the difference between the combined site and the Grammy's site is marginal. Also, the average length that visitors are stayin on the website is greater than the combined site for the Recording Academy, but lower for the Grammy's website. These two metrics suggest that the Recording Academy site is more successful than the Grammy's site at prolonging user visits. 

# Part III - Demographics


Age demographics are a way to see which audience(s) your content is resonating with the most. This can inform marketing campaigns, ads, and much more.

Let's investigate the demographics for the two websites. 

## Task 8

The `grammys_age_demographics.csv` and `tra_age_demographics.csv` each contain the following information:

- **age_group** - The age group range. e.g. `18-24` are all visitors between the ages of 18 to 24 who come to the site.
- **pct_visitors** - The percentage of all of the websites visitors that come from that specific age group.

**A.** Read in the `grammys_age_demographics.csv` and `tra_age_demograhics.csv` files and store them into dataframes named `age_grammys` and `age_tra`, respectively.

In [64]:
# read in the files
age_grammys = pd.read_csv('grammys_age_demographics.csv')
age_tra = pd.read_csv('tra_age_demographics.csv')

In [65]:
# preview the age_grammys file. the age_tra will look very similar.
age_grammys.sample()

Unnamed: 0,age_group,pct_visitors
0,18-24,27.37


**B.** For each dataframe, create a new column called `website` whose value is the name of the website.
e.g. the `age_grammys` values for `website` should all be `Grammys` and for the `age_tra` they should be `Recording Academy`.

In [66]:
# create the website column
age_grammys['website'] = 'Grammys'
age_tra['website'] = 'Recording Academy'
age_grammys.head()

Unnamed: 0,age_group,pct_visitors,website
0,18-24,27.37,Grammys
1,25-34,24.13,Grammys
2,35-44,18.72,Grammys
3,45-54,13.57,Grammys
4,55-64,9.82,Grammys


**C.** Join these two datasets together. Store the result into a new variable called `age_df`



In [67]:
# join the two datasets

age_df = pd.concat([age_grammys, age_tra])
age_df.shape

(12, 3)

<span style='background :#DDD5F3'>If done correctly your new dataframe will have `12` rows and `3` columns.</span>

**D.** Create a bar chart of the `age_group` and `pct_visitors`. This chart should have, for each age group, one color for the Recording Academy and a different color for the Grammys.




In [68]:
# Create bar chart
fig = px.histogram(age_df, x='age_group', y='pct_visitors', color='website', barmode='group')
fig.show()


**E.** Looking at the chart above, what can you say about how the age demographics differ between the two websites?

**The age demographics follow a similar trend overall, having the youngest age group as their most common and decreasing as the age groups get older and older. Some slight differences exists in thw websites popularity within age groups. The Recording Academy seems to be more popular among people ages 25-54, and less popular among people ages 18-24 and 55 and older. However, these differences are all within 2% of each other and could be random more than a sign of anything underlying.** 

# Part IV - Recommendation


## Task 9

Using the work you did in this project, would you recommend that the websites stay separate? Please give a 2-3 paragraph answer using details from the analysis work above explaining why or why not they should stay separate.

**I believe that the Grammy's and Recording Academy websites should stay separate. There is considerable difference in the activity of both websites. Specifically, the Recording Academy wesbite is more successful than the Grammy's website overall. It has a higher average pages viewed per session, a higher average of time spent on the website per session and a lower bounce rate, which means more people are interacting with the website. Furthermore, the Recording Academy website also has more consistent user behavior over the 15 month span that user activity was recorded.** 

**Even though the Grammy's website sees less activity than the Recording Academy website, it still sees more activity than the combined site did. The Grammy's website has a higher average pages viewed per session and lower bounce ratethan the combined website.**

**However, this recommendation is conditional on receiving proof that these differences are not coincidental and are not caused by something that can be incorporated into a combined website. Analyzing the demographic breakdown of visitors from both websites revealed that they appeal to similar audiences overall as far as age is concerned. These findings did not provide an explanation for the difference in user activity. Given that the Grammy's website has a significantly low average of time spent on the website compared to both the Recording Academy and the combined site, I believe that there are specific attributes on the Recording Academy website that draw in more users. Given the more consistent user behavior, I also believe that these attributes may give the Recording Academy website the year-long consumer appeal that the company is looking to achieve. Figuring out what these attributes are will provide more insight on whether or not the websites should stay separate.**