# Yandex music

The comparison between Moscow and St. Petersburg is surrounded by myths. For example:

* Moscow is a metropolis, subject to the rigid rhythm of the working week; 
* St. Petersburg is the cultural capital, with its own tastes.

Using Yandex Music data, you will compare the behavior of users in two capitals.

**The purpose of the study** is to test three hypotheses:
1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways.
2. On Monday morning in Moscow, some genres prevail, and in St. Petersburg - others. Likewise, on Friday evenings, different genres predominate, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow people listen to pop music more often, in St. Petersburg they listen to Russian rap.

**Progress of research**

You will receive data on user behavior from the `yandex_music_project.csv` file. Nothing is known about the quality of the data. Therefore, a review of the data will be needed before testing hypotheses.

You will check data for errors and evaluate their impact on the study. Then, during the preprocessing phase, you look to correct the most critical data errors.
 
Thus, the research will take place in three stages:
 1. Review of data.
 2. Data preprocessing.
 3. Testing hypotheses.

## Data review

In [113]:
import pandas as pd

In [114]:
df = pd.read_csv('/datasets/yandex_music_project.csv')

In [115]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [116]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


So, the table has seven columns.Type of data in all columns - `Object`.

According to data documentation:

* `userID` — user ID;
* `Track` — name of the track;  
* `artist` — name of the performer;
* `genre` — name of the genre;
* `City` — city of the user;
* `time` — start time of listening;
* `Day` — day of the week.

The number of values in the columns differs.This means that there are missing values.

**Conclusions**

In each line of the table - data on the listed track. Part of the columns describes the composition itself: name, artist and genre. The rest of the data is about the user: from which city he/she is when he/she listened to music. There are missing values

## Data preprocessing

### Style

In [121]:
df = df.rename(
    columns={
    '  userID': 'user_id',
    'Track': 'track',
    '  City  ': 'city',
    'Day': 'day'
    }
) # renaming columns

In [122]:
df.columns # check

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [123]:
df.isna().sum() # counting NaNs

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missed values affect the study.So in `track` and` artist`, passes are not important for the analysis. It is enough to replace them with obvious designations.

But missing values in column "genre" can interfere with the comparison of musical tastes in Moscow and St. Petersburg. 
In practice, it would be necessary to correctly establish the cause of the passes and restore data.
There is no such opportunity in the educational project, however.

Hence, we have to:

* fill in these gaps with obvious designations;
* evaluate how much they damage the calculations.

In [183]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [184]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [126]:
df.duplicated().sum() # explicit duplicates

3826

In [131]:
df = df.drop_duplicates() # duplicates deletion

In [185]:
df.duplicated().sum() # check

0

Now we get rid of implicit duplicates in the `genre` column. For example, the name of the same genre can be recorded a little differently. Such errors will also affect the results of the study.

In [215]:
df['genre'].sort_values().unique() # unique genres

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [216]:
df = df.replace(['hip', 'hop', 'hip-hop'], 'hiphop') # implicit duplicates deletion

In [218]:
df['genre'].sort_values().unique() # check

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusion**

Preprocessing has discovered three problems in the data:

- violations in the style of the header,
- missed values,
- duplicates: explicit and implicit.

We corrected the headers to simplify the work with the table.
Without duplicates, the study will become more accurate.

We replaced the missed values with `'unknown'`

Now we can proceed to check the hypotheses.

## Check Hypothesis

### Comparison of the behavior of users of two capitals

The first hypothesis claims that users listen to music differently in Moscow and St. Petersburg. 

We check this assumption according to the three days of the week - Monday, Wednesday and Friday

In [141]:
df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

In Moscow, there are more music replays than in St. Petersburg. 

It does not follow from this that Moscow users are more likely to listen to music.
It's just that there are more users in Moscow

In [145]:
df.groupby('day')['genre'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

On average, users of two cities are less active on Wednesdays. But the picture can change if you consider each city separately.

In [149]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [150]:
number_tracks('Monday', 'Moscow') # the number of replays in Moscow on Mondays

15740

In [151]:
number_tracks('Monday', 'Saint-Petersburg') # the number of replays in St. Petersburg on Mondays

5614

In [152]:
number_tracks('Wednesday', 'Moscow') # the number of replays in Moscow on Wednesdays

11056

In [153]:
number_tracks('Wednesday', 'Saint-Petersburg') # the number of replays in St. Petersburg on Wednesdays

7003

In [155]:
number_tracks('Friday', 'Moscow') # the number of replays in Moscow on Fridays

15945

In [154]:
number_tracks ('Friday', 'Saint-Petersburg') # the number of replays in St. Petersburg on Fridays

5895

In [160]:
columns = ['city', 'monday', 'wednesday', 'friday'] # table with results
activity_by_day_and_city = pd.DataFrame([['Moscow', 15740, 11056, 15945], ['Saint-Petersburg', 5614, 7003, 5895]], columns=columns)
activity_by_day_and_city

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusion**

Data show the difference in user behavior:

- In Moscow, the peak of replays falls on Monday and Friday, and on Wednesday a decline is noticeable.
- In St. Petersburg, on the contrary, they listen to music on Wednesdays more. Activity on Monday and Friday here is inferior to Wednesday.

So, the data speaks in favor of the first hypothesis.

### Music at the beginning and at the end of the week

According to the second hypothesis, on Monday morning, some genres prevail in Moscow, and others in St. Petersburg. On Friday evening, different genres prevail - depending on the city.

In [203]:
moscow_general = df[df['city'] == 'Moscow']

In [204]:
spb_general = df[df['city'] == 'Saint-Petersburg']

In [232]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    return genre_df_sorted[:10]

In [233]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00') # top-10 music genres played in Moscow from 07:00 to 11:00 on Monday

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [234]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00') # top-10 music genres played in St. Petersburg from 07:00 to 11:00 on Monday

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [235]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00') # top-10 music genres played in Moscow on Friday evening

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [236]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00') # top-10 music genres played in St. Petersburg on Friday evening

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusion**

If you compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the “World” genre is included in the Moscow rating, but Jazz and Classic in St. Petersburg.

2. There were so many missed values in Moscow that the value of "unknown" took the tenth place among the most popular genres. Hence, missing values occupy a significant share of the data and threaten the reliability of the study.

Friday evening does not change this picture. Some genres rise a little higher, others go down, but in general the top 10 remains the same.

Thus, the second hypothesis was confirmed only partially:

* Users listen to similar music at the beginning of the week and at the end.

* The difference between Moscow and St. Petersburg is not so pronounced. In Moscow, they often listen to Russian popular music, in St. Petersburg - jazz.

However, gaps in the data question this result. In Moscow, there are so many of them that the TOP-10 rating could look different if it were not for the lost data about the genres.

### Genre preferences in Moscow and St. Petersburg

Hypotheses: 

* Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow.

* Moscow is a city of contrasts in which, nevertheless, pop music prevails.

In [209]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [210]:
moscow_genres[:10]

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2095
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [211]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [213]:
spb_genres[:10]

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

The hypothesis was partially confirmed:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top-10 genres there is a similar genre - Russian popular music.

* Contrary to the expectations, rap is equally popular in Moscow and St. Petersburg.


## Results

We checked three hypotheses and concluded:

1. The day of the week affects the activity of users in Moscow and St. Petersburg. 

The first hypothesis was fully confirmed.

2. Musical preferences do not change much within a week — be it Moscow or Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:

* in Moscow they listen to music of the genre “world”,
* in St. Petersburg - jazz and classics.

Thus, the second hypothesis was confirmed only in part. This result could be different if it were not for the gaps in the data.

3. The tastes of users of Moscow and St. Petersburg have much in common. Contrary to expectations, the preferences of genres in St. Petersburg resemble that of Moscow.

The third hypothesis was not confirmed. If the differences in preferences exist, they are negligible.