# Music Preferences Analysis

## Contents <a id='back'></a>

* [Introduction](#intro)
* [Stage 1. Data overview](#data_review)
    * [Conclusions](#data_review_conclusions)
* [Stage 2. Data preprocessing](#data_preprocessing)
    * [2.1 Header style](#header_style)
    * [2.2 Missing values](#missing_values)
    * [2.3 Duplicates](#duplicates)
    * [2.4 Conclusions](#data_preprocessing_conclusions)
* [Stage 3. Testing the hypotheses](#hypotheses)
    * [3.1 Hypothesis 1: user activity in the two cities](#activity)
    * [3.2 Hypothesis 2: music preferences on Monday and Friday](#week)
    * [3.3 Hypothesis 3: genre preferences in Springfield and Shelbyville](#genre)
    * [Conclusions](#hypotheses_testing_conclusions)
* [Findings](#end)

## 1. Introduction <a id='intro'></a>

In this project, we are comparing the music preferences of the cities of Springfield and Shelbyville based on their behavior, using Yandex.Music data.

We will be testing the following **hypotheses**:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

There is no information about the quality of the data, so we will explore it before testing the hypotheses. First, we'll evaluate the quality of the data and see whether its issues are significant. Then, during data preprocessing, we will try to account for the most critical problems.
 
The project consists of three **stages**:
 1. Data overview
 2. Data preprocessing
 3. Testing the hypotheses
 
 [⬆ Back to Contents](#back)

## 2. Stage 1: Data Overview <a id='data_review'></a>

In [1]:
# import necessary libraries
import pandas as pd
import re

In [2]:
# read data from csv file
music_data = pd.read_csv('music_project_en.csv')

music_data.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
5,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
8,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Shelbyville,21:20:49,Wednesday


In [3]:
# obtaining general information about the data
music_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63736 non-null  object
 2   artist    57512 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table contains 7 columns, all of them stored with the same data type: `object`.

According to the documentation:
- `'userID'` — user identifier
- `'Track'` — track title
- `'artist'` — artist's name
- `'genre'`
- `'City'` — user's city
- `'time'` — the exact time the track was played
- `'Day'` — day of the week

We can see three issues with style in the column names:
1. Some names are uppercase, some are lowercase.
2. There are spaces in some names.
3. Some columns use camelCase.

Additionally, the number of column values is different. This means the data contains missing values.

### Conclusions <a id='data_review_conclusions'></a>

Each row in the table stores data on a track that was played. Some columns describe the track itself: its title, artist and genre. The rest convey information about the user: the city they come from, the time they played the track. 

It's clear that the data is sufficient to test the hypotheses. However, there are missing values.

To move forward, we need to preprocess the data.

[⬆ Back to Contents](#back)

## 3. Stage 2. Data preprocessing <a id='data_preprocessing'></a>

### 3.1 Header style <a id='header_style'></a>
We'll proceed to correct the formatting in the column headers and deal with the missing values. For this, we will change the column names according to the rulesof good styling: 
* If the name has several words, use snake_case
* All characters must be lowercase
* Delete spaces

In [4]:
# list of column names in the table
columns = music_data.columns
columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [5]:
# renaming columns using regular expressions
def renaming_columns(columns):
    '''
    Rename the columns of a dataframe by separating camelCase, removing leading and trailing spaces,
    changing to lowercase, and replacing spaces between words with underscores for snake case.
    
    parameters: columns - a list of the original headers of the dataframe.
    
    returns: a dictionary with the the original headers as keys and cleaned headers as items.
    '''
    
    new_headers = {}
    
    for header in columns:
        new_header = re.sub(r'([a-z])([A-Z])', r'\1 \2', header)  #split camel case into sep words
        new_header = new_header.strip().lower().replace(' ', '_') 
        new_headers[header] = new_header
        
    return new_headers

columns_clean = renaming_columns(columns)
                  

music_data = music_data.rename(columns = columns_clean)

<div style="background-color: snow; padding: 10px;border-left: 7px solid pink">
    ✨
Regular expressions weren't necessary for this task; renaming the columns could have been accomplished in a simpler and faster manner. However, using for regular expressions provided an opportunity for additional practice. Moreover, this approach facilitates automation, ensuring consistency across headers if more tables require renaming in the future.

In [6]:
music_data.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday


[⬆ Back to Contents](#back)

### 3.2 Missing Values <a id='missing_values'></a>

Now, we'll check whether there are duplicates in the data. First, we'll find the number of missing values in the table.

In [7]:
# calculating missing values
missing_values_count = music_data.isna().sum()
missing_values_count

user_id       0
track      1343
artist     7567
genre      1198
city          0
time          0
day           0
dtype: int64

In [8]:
# percentages of missing values
missing_values_per = music_data.isna().sum() / music_data.shape[0]
missing_values_per

user_id    0.000000
track      0.020636
artist     0.116274
genre      0.018408
city       0.000000
time       0.000000
day        0.000000
dtype: float64

We have missing values for the `track`, `artist` and `genre` columns. The `artist` column is missing a high number of values, accounting for almost 12% of the rows in the table, as opposed to the other two columns, which are missing around 2%. However, not all missing values affect the research; the missing values in `track` and `artist` are not critical.

But missing values in `'genre'` can affect the comparison of music preferences in Springfield and Shelbyville. Since we do not have the opportunity to learn the reasons why the data is missing and try to make up for them, we will:
* Fill in these missing values with markers
* Evaluate how much the missing values may affect our computations

We will continue by replacing the missing values in `'track'`, `'artist'`, and `'genre'` with the string `'unknown'`.

In [9]:
# looping over column names and replacing missing values with 'unknown'
columns_to_replace = ('track', 'artist', 'genre')

for col in columns_to_replace:
    music_data[col] = music_data[col].fillna('unknown')

To make sure the table contains no more missing values, let's count the missing values again.

In [10]:
# counting missing values
missing_values_count_upd = music_data.isna().sum()
missing_values_count_upd

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

[⬆ Back to Contents](#back)

### 3.3 Duplicates <a id='duplicates'></a>

Now, we'll find the number of duplicated rows in the table and delete them.

In [11]:
# counting clear duplicates
music_data.duplicated().sum()

3826

In [12]:
# removing obvious duplicates
music_data = music_data.drop_duplicates().reset_index(drop=True)

In [13]:
# checking for duplicates
music_data.duplicated().sum()

0

Now that there are no more obvious duplicates, let's check for the implicit duplcates in the column `genre`. Let's check whether there are genre names written in different ways. These could be names written incorrectly or alternative names of the same genre. Such errors will also affect the result.

In [14]:
# viewing unique genre names
unique_genres = music_data['genre'].sort_values().unique()
unique_genres_sorted = sorted(unique_genres)

unique_genres_sorted

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

[⬆ Back to Contents](#back)

From the previous list, we can see the following implicit duplicates:
* `hip`
* `hop`
* `hip-hop`

To get rid of them, we'll create a function that should replace each value from the `wrong_genres` list with the value in `correct_genre`.

In [15]:
# function for replacing implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    '''
    Replace incorrect genre names (implicit duplicates) with the correct name.
    
    Parameters: wrong_genres - list of incorrec labels to be replaced.
                correct_genre - string of the correct genre name to replace wrong genres.
                
    Returns: data frame with incorrect genre names replaced.
    '''
    
    music_data['genre'] = music_data['genre'].replace(wrong_genres, correct_genre)
    return music_data

We'll call `replace_wrong_genres()` and pass it arguments so that it clears implicit duplcates (`hip`, `hop`, and `hip-hop`) and replaces them with `hiphop`:

In [16]:
# removing implicit duplicates
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
    
music_data = replace_wrong_genres(wrong_genres, correct_genre)

To make sure the duplicate names were removed, let's print the list of unique values from the `'genre'` column once again.

In [17]:
# checking for implicit duplicates
unique_genres = music_data['genre'].sort_values().unique()
unique_genres_sorted = sorted(unique_genres)

unique_genres_sorted

['acid',
 'acoustic',
 'action',
 'adult',
 'africa',
 'afrikaans',
 'alternative',
 'ambient',
 'americana',
 'animated',
 'anime',
 'arabesk',
 'arabic',
 'arena',
 'argentinetango',
 'art',
 'audiobook',
 'avantgarde',
 'axé',
 'baile',
 'balkan',
 'beats',
 'bigroom',
 'black',
 'bluegrass',
 'blues',
 'bollywood',
 'bossa',
 'brazilian',
 'breakbeat',
 'breaks',
 'broadway',
 'cantautori',
 'cantopop',
 'canzone',
 'caribbean',
 'caucasian',
 'celtic',
 'chamber',
 'children',
 'chill',
 'chinese',
 'choral',
 'christian',
 'christmas',
 'classical',
 'classicmetal',
 'club',
 'colombian',
 'comedy',
 'conjazz',
 'contemporary',
 'country',
 'cuban',
 'dance',
 'dancehall',
 'dancepop',
 'dark',
 'death',
 'deep',
 'deutschrock',
 'deutschspr',
 'dirty',
 'disco',
 'dnb',
 'documentary',
 'downbeat',
 'downtempo',
 'drum',
 'dub',
 'dubstep',
 'eastern',
 'easy',
 'electronic',
 'electropop',
 'emo',
 'entehno',
 'epicmetal',
 'estrada',
 'ethnic',
 'eurofolk',
 'european',
 'expe

<div style="background-color: snow; padding: 10px;border-left: 7px solid pink">
    ✨
<code>Hiphop</code> was the only required genre to change. It's worth noting that besides hiphop, there seem to be other genres that might also be implicit duplicates (based on names that look very similar or names that might represent the same genre). Some of these are new/neue/newage, latin/latino and türk/türkçe. These are some examples, but there might be others. Verifying whether these are duplicates might require more manual inspection and domain knowledge. For the purpose of this analysis, we will only consider the variations of hiphop as duplicates.



<div style="padding: 10px;border-left: 7px solid pink">
    ✨
We'll also verify the unique values for the <code>city</code> and <code>day</code> columns,to make sure there are no duplicates.

In [18]:
# viewing unique cities
unique_cities = music_data['city'].sort_values().unique()
unique_cities_sorted = sorted(unique_cities)

unique_cities_sorted

['Shelbyville', 'Springfield']

In [19]:
# viewing unique genre names
unique_days = music_data['day'].sort_values().unique()
unique_days_sorted = sorted(unique_days)

unique_days_sorted

['Friday', 'Monday', 'Wednesday']

<div style="padding: 10px;border-left: 7px solid pink">
    ✨
The table only contains the city names and the days of the week that are relevant for our analysis.

### Conclusions <a id='data_preprocessing_conclusions'></a>
We detected three issues with the data:

- Incorrect header styles
- Missing values
- Obvious and implicit duplicates

The headers have been cleaned up to make processing the table simpler.

All missing values have been replaced with `'unknown'`. But we still have to see whether the missing values in `'genre'` will affect our calculations.

The absence of duplicates will make the results more precise and easier to understand.

Now we can move on to testing hypotheses. 

[⬆ Back to Contents](#back)

##  4. Stage 3: Testing hypotheses <a id='hypotheses'></a>

### 4.1 Hypothesis 1: comparing user behavior in two cities <a id='activity'></a>

According to the first hypothesis, users from Springfield and Shelbyville listen to music differently. Let's test this using the data on the three days of the week: Monday, Wednesday, and Friday.

We will:
* Divide the users into groups by city.
* Compare how many tracks each group played on Monday, Wednesday, and Friday.

**Requirement:** For the sake of practice, perform each computation separately.

In [20]:
# Counting up the tracks played in each city
music_data.groupby('city')['track'].count()

city
Shelbyville    18512
Springfield    42741
Name: track, dtype: int64

Springfield has more tracks played than Shelbyville. But that does not imply that citizens of Springfield listen to music more often. This city is simply bigger, and there are more users.

Let's group the data by day of the week and find the number of tracks played on Monday, Wednesday, and Friday.

In [21]:
# Calculating tracks played on each of the three days
music_data.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

Wednesday is the quietest day overall. But if we consider the two cities separately, we might come to a different conclusion.

Let's write a function that will group by both by city and day. The function will calculate the number of songs played for a given day and city, applying consecutive filtering with logical indexing.

In [22]:
# creating the function number_tracks

def number_tracks(day='Monday', city='Springfield'):
    '''
    Count the number of songs played for a given day and city.
    
    Parameters: day - string of day of the week.
                city - string of the name og the city.
                
    Returns: count of user ids for the specified day and city.
    '''

    # track_list variable stores the rows for the day and city
    track_list = music_data[(music_data['day'] == day) & (music_data['city'] == city)]
    
    # track_list_count variable store the number of 'user_id' column values in track_list
    track_list_count = track_list['user_id'].count()
    
    # return the value of track_list_count
    return track_list_count

print(number_tracks())

15740


Now let's use `number_tracks()` for each combination, to retrieve the data on both cities for each of the three days.

In [23]:
# the number of songs played in Springfield on Monday
number_tracks('Monday', 'Springfield')

15740

In [24]:
# the number of songs played in Shelbyville on Monday
number_tracks('Monday', 'Shelbyville')

5614

In [25]:
# the number of songs played in Springfield on Wednesday
number_tracks('Wednesday', 'Springfield')

11056

In [26]:
# the number of songs played in Shelbyville on Wednesday
number_tracks('Wednesday', 'Shelbyville')

7003

In [27]:
# the number of songs played in Springfield on Friday
number_tracks('Friday', 'Springfield')

15945

In [28]:
# the number of songs played in Shelbyville on Friday
number_tracks('Friday', 'Shelbyville')

5895

To better visualize the output, let's  create a table, where

* Column names are: `['city', 'monday', 'wednesday', 'friday']`
* The data is the results you got from `number_tracks()`

In [29]:
# table with results
cities = ['Springfield', 'Shelbyville']
days = ['Monday', 'Wednesday', 'Friday']

number_tracks_data = []

for city in cities:
    data = {'city' : city}
    for day in days:
        data[day.lower()] = number_tracks(day,city)
    number_tracks_data.append(data)
    
# print(number_tracks_data)

column_names = ['city', 'monday', 'wednesday', 'friday']
number_tracks_df = pd.DataFrame(data = number_tracks_data, columns = column_names)
number_tracks_df

Unnamed: 0,city,monday,wednesday,friday
0,Springfield,15740,11056,15945
1,Shelbyville,5614,7003,5895


**Conclusions**

The data reveals differences in user behavior:

- In Springfield, the number of songs played peaks on Mondays and Fridays, while on Wednesday there is a decrease in activity.
- In Shelbyville, on the contrary, users listen to music more on Wednesday. User activity on Monday and Friday is smaller.

So the first hypothesis seems to be correct.

[⬆ Back to Contents](#back)

### 4.2 Hypothesis 2: music at the beginning and end of the week <a id='week'></a>

According to the second hypothesis, on Monday morning and Friday night, citizens of Springfield listen to genres that differ from ones users from Shelbyville enjoy. To analyze the data, let's create wo tables, one for each city.

In [30]:
# create the spr_general table 
spr_general = music_data[music_data['city'] == 'Springfield'].reset_index(drop=True)
spr_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Springfield,14:07:09,Friday
1,E2DC1FAE,Soul People,Space Echo,dance,Springfield,08:34:34,Monday
2,4CB90AA5,True,Roman Messer,dance,Springfield,13:00:07,Wednesday
3,F03E1C1F,Feeling This Way,Polina Griffith,dance,Springfield,20:47:49,Wednesday
4,8FA1D3BE,L’estate,Julia Dalia,ruspop,Springfield,09:17:40,Friday


In [31]:
# create the shel_general
shel_general = music_data[music_data['city'] == 'Shelbyville'].reset_index(drop=True)
shel_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Shelbyville,20:28:33,Wednesday
1,20EC38,Funiculì funiculà,Mario Lanza,pop,Shelbyville,20:58:07,Wednesday
2,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Shelbyville,08:37:09,Monday
3,842029A1,Chains,Obladaet,rusrap,Shelbyville,13:09:41,Friday
4,E772D5C0,Pessimist,unknown,dance,Shelbyville,21:20:49,Wednesday


Let's write a function to return info on the 15 most popular genres on a given day within the period between the two timestamps.

In [32]:

def genre_weekday(df, day, time1, time2):
    '''
    Find the 15 most popular genres on a given day within the specified period of time.
    
    Parameters: df - data frame with the track data.
                day - day of the week.
                time1 - start time, in 'hh:mm' format.
                time2 - end time, in 'hh:mm' format.
    
    Returns: s series with the top 15 genres played.
    '''

    # store rows where the day is equal to the input day
    genre_df = df[df['day'] == day]
    
    # store only those rows within the specified time
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df = genre_df[genre_df['time'] > time1]
    
    # count the values that meet the requirements
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    
    # sort the result in descending order
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    
    # return the 15 most popular genres on a given day in a given timeframe
    return genre_df_sorted[:15]

Let's compare the results of the `genre_weekday()` function for Springfield and Shelbyville on Monday morning (from 7AM to 11AM) and on Friday evening (from 17:00 to 23:00):

In [33]:
# calling the function for Monday morning in Springfield
genre_weekday(spr_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
classical      157
metal          120
jazz           100
folk            97
soundtrack      95
Name: genre, dtype: int64

In [34]:
# calling the function for Monday morning in Shelbyville
genre_weekday(shel_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
world           36
rap             32
soundtrack      31
rnb             27
metal           27
Name: genre, dtype: int64

In [35]:
# calling the function for Friday evening in Springfield
genre_weekday(spr_general, 'Friday', '13:00', '23:00')

genre
pop            1377
dance          1017
rock            988
electronic      924
hiphop          537
classical       402
world           363
alternative     352
ruspop          331
rusrap          293
jazz            261
soundtrack      197
metal           184
unknown         178
rnb             172
Name: genre, dtype: int64

In [36]:
# calling the function for Friday evening in Shelbyville
genre_weekday(shel_general, 'Friday', '13:00', '23:00')

genre
pop            537
dance          447
rock           437
electronic     389
hiphop         220
classical      166
alternative    146
ruspop         123
rusrap         118
world          111
jazz           106
soundtrack      78
metal           73
folk            73
unknown         72
Name: genre, dtype: int64

**Conclusion**

Having compared the top 15 genres on Monday morning, we can draw the following conclusions:

1. Users from Springfield and Shelbyville listen to similar music. The top five genres are the same, only rock and electronic have switched places.

2. In Springfield, the number of missing values turned out to be so big that the value `'unknown'` came in 10th. This means that missing values make up a considerable portion of the data, which may be a basis for questioning the reliability of our conclusions.

For Friday evening, the situation is similar. Individual genres vary somewhat, but on the whole, the top 15 is similar for the two cities.

Thus, the second hypothesis has been partially proven true:
* Users listen to similar music at the beginning and end of the week.
* There is no major difference between Springfield and Shelbyville. In both cities, pop is the most popular genre.

However, the number of missing values makes this result questionable. In Springfield, there are so many that they affect our top 15. Were we not missing these values, things might look different.

[⬆ Back to Contents](#back)

### 4.3 Hypothesis 3: genre preferences in Springfield and Shelbyville <a id='genre'></a>

The last hypothesis states that Shelbyville loves rap music and Springfield's citizens are more into pop. To check that, let's find the number of songs played for each genre in each city.

In [37]:
# for Springfield
spr_genres = spr_general.groupby('genre')['genre'].count().sort_values(ascending=False)
spr_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [38]:
# for Shelbyville
shel_genres = shel_general.groupby('genre')['genre'].count().sort_values(ascending=False)

shel_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

### Conclusions <a id='hypotheses_testing_conclusions'></a>

The hypothesis has been partially proven true:
* Pop music is the most popular genre in Springfield, as expected.
* However, pop music turned out to be equally popular in Springfield and Shelbyville, and rap wasn't in the top 5 for either city.


[⬆ Back to Contents](#back)

# 5. Findings <a id='end'></a>

We have tested the following three hypotheses:

1. User activity differs depending on the day of the week and from city to city. 
2. On Monday mornings, Springfield and Shelbyville residents listen to different genres. This is also true for Friday evenings. 
3. Springfield and Shelbyville listeners have different preferences. In Springfield, they prefer pop, while Shelbyville has more rap fans.

After analyzing the data, we concluded:

1. User activity in Springfield and Shelbyville depends on the day of the week, though the cities vary in different ways. 

The first hypothesis is fully accepted.

2. Musical preferences do not vary significantly over the course of the week in both Springfield and Shelbyville. We can see small differences in order on Mondays, but:
* In Springfield and Shelbyville, people listen to pop music most.

So we can't accept this hypothesis. We must also keep in mind that the result could have been different if not for the missing values.

3. It turns out that the musical preferences of users from Springfield and Shelbyville are quite similar.

The third hypothesis is rejected. If there is any difference in preferences, it cannot be seen from this data.

[⬆ Back to Contents](#back)

<div style="background-color: oldlace; padding: 10px;">
<b>Note:</b><br> 
    This project is part of the Data Analyst Bootcamp at Tripleten - Sprint 5: Python Fundamentals.<br><br>
    It was noted that in real projects, research involves statistical hypothesis testing, which is more precise and more quantitative.<br>
    It was also noted that conclusions cannot always be draw conclusions about an entire city based on the data from just one source.<br><br>
    The structure and content was given as part of the instructions.<br>
    Any part of the pre-processing or analysis out of the requirements and additional commments are marked with a pink left border and the ✨ icon.
    
 </div>