# Music Streaming

**Analysis goal** — check 3 hypothesis:
1. Users' activity is different on different days of the week. Moreover, there is a difference between Moscow and St.Petersburg. 
2. On Monday morning certain music genres dominate in Moscow while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres depending on the city.
3. Moscow and St. Petersburg prefer different music genres. In Moscow people mostly listen to pop music, in St. Petersburg - Russian rap.

**Analysis structure**:
* Data overview
* Data preparation
* Hypothesis check

## Data overview

In [86]:
import pandas as pd

df = pd.read_csv('datasets/yandex_music_project.csv')
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [87]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


There are 7 columns in the table, data types in all of them are `object`.

Non-null count for columns is different, so there are missing values in data.

It seems there is enough data to test hypotheses. But there are gaps in the data and the names of the columns need to be fixed.

## Data preparation

### Columns names

In [88]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [89]:
df = df.rename(columns={
                    '  userID': 'user_id',
                    'Track': 'track',
                    '  City  ': 'city',
                    'Day': 'day'
                        })

### Missing values in data

In [90]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Missing values in columns `track`, `artist`, and `genre` are replaced for `'unknown'`.

In [91]:
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [92]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [93]:
df.duplicated().sum()

3826

In [94]:
df = df.drop_duplicates().reset_index(drop=True)

Need to check for the implicit duplicates in the column `genre`. For example, the name of the same genre can be spelled slightly differently. Such errors will affect the result of the study.

In [95]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Found some implicit duplicates for `hiphop`:
* *hip*,
* *hop*,
* *hip-hop*.

To get rid of them we create a function `replace_wrong_genres()` with two parameters: 
* `wrong_genres`,
* `correct_genre`.

The function replaces the name of the wrong genre from the list `wrong_genres` to the value from `correct_genre`.

In [96]:
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [97]:
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
replace_wrong_genres(wrong_genres, correct_genre)

Check for the fixed wrong genres in `genre`:

In [98]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

## Hypothesis check

### Comparison of user behavior in two cities

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg.  
We can check this assumption on three days of the week - Monday, Wednesday and Friday.

In [99]:
df_msc = df[df['city'] == 'Moscow']
df_spb = df[df['city'] == 'Saint-Petersburg']

display(df_msc['track'].count(), df_spb['track'].count())

42741

18512

For Moscow the number is way higher. Although it doesn't apply than users in Moscow listen to music more often; there are just more users in Moscow (as city is almost 3 times bigger than St. Petersburg).


In [100]:
df.groupby('day')['track'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: track, dtype: int64

On average, users from two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

We create `number_tracks()` function that counts the plays for a given day and city. It takes two parameters:
* day of the week,
* city name.

In [101]:
def number_tracks(day, city):
    df_count = df[df['day'] == day]
    df_count = df_count[df_count['city'] == city]
    track_list_count = df_count['user_id'].count()
    return track_list_count

In [102]:
number_tracks('Monday', 'Moscow')

15740

In [103]:
number_tracks('Monday', 'Saint-Petersburg')

5614

In [104]:
number_tracks('Wednesday', 'Moscow')

11056

In [105]:
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [106]:
number_tracks('Friday', 'Moscow')

15945

In [107]:
number_tracks('Friday', 'Saint-Petersburg')

5895

In [108]:
columns = ['city', 'Monday', 'Wednesday', 'Friday'] 
data = {('Moscow', 15740, 11056, 15945), ('Saint-Petersburg', 5614, 7003, 5895)}
        
pd.DataFrame(data=data, columns=columns)

Unnamed: 0,city,Monday,Wednesday,Friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Findings**

The data shows the difference in user behavior:

- In Moscow, the peaks of listening are on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, people listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So the data support the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday mornings certain music genres dominate in Moscow while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres depending on the city.

In [109]:
moscow_general = df_msc.copy()

In [110]:
spb_general = df_spb.copy()

We create a function `genre_weekday()` with 4 parameters:
* dataframe,
* day of the week,
* beginning time in 'hh:mm' format, 
* ending time in 'hh:mm' format.

The function returns the top 10 genres of the tracks that were listened to on the specified day, in the interval between two timestamps.

In [111]:
def genre_weekday(table, day, time1, time2):
    
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
     
    return genre_df_sorted.head(10)

We compare the results of `genre_weekday()` for Moscow and St. Petersburg on Monday morning (7:00 - 11:00) and Friday evening (17:00 - 23:00):

In [112]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [113]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [114]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [115]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Findings**

For Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg people listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values in Moscow rating that the value `'unknown'` took the tenth place among the most popular genres. This means that missing values generate a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, people listen to Russian popular music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on the result.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

In [116]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [117]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [118]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [119]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Findings**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, St. Petersburg is not the capital of rap, but still this genre is listened to more than in Moscow.


## Research results

**На практике исследования содержат проверки статистических гипотез.**
Из данных одного сервиса не всегда можно сделать вывод о всех жителях города.
Проверки статистических гипотез покажут, насколько они достоверны, исходя из имеющихся данных. 
С методами проверок гипотез вы ещё познакомитесь в следующих темах.


We tested three hypotheses and found:

1. The day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week both in Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week:
* in Moscow they listen to music of the “world” genre,
* in St. Petersburg - jazz and classical music.

Thus, the second hypothesis was only partly confirmed. This result could have been different if there were no for gaps in the data.

3. Users' tastes in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed.