# Yandex Music

Yandex Music is a musical app similar to Apple Music or Spotify.
The goal of this study is to compare the musical preferences of two audiences in Moscow and San Petersburg.

We are going to test three hypotheses:

The user's activity depends on the weekday, and in Moscow and St. Petersburg, there are differences.
    
In Moscow, on Monday morning people listen to some genres, and in St. Petersburg others. Similarly, on Friday evening, different genres are listened to depending on the city.
    
Moscow and St. Petersburg prefer different music genres. In Moscow, people prefer pop music, in St. Petersburg, Russian rap.

## Data Overview

In [2]:
# import library
import pandas as pd

In [3]:
# reading from the file
df = pd.read_csv(r'C:\Users\pinos\Desktop\yandex_music_project.csv')

In [4]:
display(df.head(10))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [5]:
# info about the table
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 1.7+ MB


In [6]:
# cleaning up the variable's name
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})
display(df.head())

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday


In [8]:
# detecting missing values
display(df.isna().sum())

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [9]:
# replacing missing values with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
# just checking
display(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

In [11]:
# duplicated values
print(df.duplicated().sum())

3826


In [12]:
# deleting duplicates values
df = df.drop_duplicates() 

In [13]:
# just checking
print(df.duplicated().sum())

0


In [14]:
# unique values in the column genres
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [15]:
# we are going to replace these hip, hop, hip-hop by hiphop.
# to do that we create a function:
def replace_wrong_values(wrong_values, correct_value): 
    for wrong_value in wrong_values: 
        df['genre'] = df['genre'].replace(wrong_value, correct_value) 

duplicates = ['hip', 'hop', 'hip-hop'] 
name = 'hiphop' 
replace_wrong_values(duplicates, name) 

In [16]:
# just checking if everything is allright
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

## Hypothesis testing

### User behavior comparison between cities

The first hypothesis states that music preferences differ in Moscow and St. Petersburg. We will check this assumption based on the data about three days of the week: Monday, Wednesday, and Friday.

In [17]:
# counting listenings in each city 
df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

In [18]:
# counting listenings for each day
df.groupby('day')['genre'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

In [19]:
# we write a function that counts the listenings for a given day and city.
def number_tracks(day, city):
    track_list = df[(df['day'] == day)  &  (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [20]:
# calling the function
# number of listenings in Moscow on Monday.
number_tracks('Monday', 'Moscow')

15740

In [21]:
# number of listenings in St. Petersburg on Monday.
number_tracks('Monday', 'Saint-Petersburg')

5614

In [22]:
# number of listenings in Moscow on Wednesday.
number_tracks('Wednesday', 'Moscow')

11056

In [23]:
# number of listenings in St. Petersburg on Wednesday.
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [24]:
# number of listenings in Moscow on Friday.
number_tracks('Friday', 'Moscow') 

15945

In [25]:
# number of listenings in St. Petersburg on Friday.
number_tracks('Friday', 'Saint-Petersburg')

5895

In [27]:
# outcome table
info = pd.DataFrame(data=[['Moscow', 15740, 11056, 
15945], ['St. Petersburg', 5614, 7003, 5895]], columns=['city', 'monday', 'wednesday', 'friday'])
info

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,St. Petersburg,5614,7003,5895


The data shows the difference in user behavior: In Moscow, the peak of auditions falls on Monday and Friday, and on Wednesday there is a noticeable decline. In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday is almost equally inferior to Wednesday here. So, the data speak in favor of the first hypothesis.

According to the second hypothesis, on Monday morning some genres prevail in Moscow and others in St. Petersburg. Similarly, on Friday evening, different genres prevail, depending on the city. Now, we are going to test that.

In [28]:
# new variables
moscow_general = df[df['city'] == 'Moscow']
spb_general = df[df['city'] == 'Saint-Petersburg']

In [29]:
# we create the function that check the second hypothesis
def genre_weekday(df, day, time1, time2):
    genre_df = df[(df['day'] == day) & (df['time'] > time1) & (df['time'] < time2)]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

In [30]:
# calling the function for Monday morning in Moscow
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [31]:
# calling the function for Monday morning in St. Petersburg
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [32]:
# the same for Moscow on Friday evening
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [33]:
# Petersburg on Friday evening
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

In Moscow and St. Petersburg, they listen to similar music. The only difference is that the “world” genre entered the Moscow rating, and jazz and classical music entered the St. Petersburg rating.

In Moscow, there were so many missing values that the value of 'unknown' took tenth place among the most popular genres. This means that the missing values occupy a significant share of the data and threaten the reliability of the study.

Friday night doesn't change that picture. Some genres rise a little higher, others go down, but overall the top 10 remains.

### Preferences by genre in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, music of this genre listens to there more often than in Moscow. And Moscow is a city of contrasts, in which, nevertheless, pop music prevails.

In [36]:
# grouping and sorting genres in Moscow
moscow_grouping = moscow_general.groupby('genre')['genre'].count()
moscow_genres = moscow_grouping.sort_values(ascending=False)

In [35]:
display(moscow_genres.head(10))

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [37]:
# the same for St. Petersburg
spb_grouping = spb_general.groupby('genre')['genre'].count()
spb_genres = spb_grouping.sort_values(ascending=False)

In [38]:
display(spb_genres.head(10))

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

The hypothesis was partially confirmed:

Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres, there is a similar genre: Russian popular music.

Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

### Conclusions

The day of the week has different effects on user activity in Moscow and St. Petersburg.
The first hypothesis was fully confirmed.

Musical preferences don't change much during the week — whether it's Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
in Moscow, they listen to the music of the genre “world”,
in St. Petersburg — jazz and classical.
Thus, the second hypothesis was only partially confirmed. This result could have been different if not for the omissions in the data.

The tastes of users in Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.
The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the majority of users.