# Research of user preferences of residents of Moscow and St. Petersburg in the Yandex.Music application

The aim of the study is to test three hypotheses:

1. User activity depends on the day of the week. Moreover, in Moscow and St. Petersburg this manifests itself in different ways. 
2. On Monday morning, certain genres prevail in Moscow, while others prevail in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, they listen to pop music more often, in St. Petersburg - Russian rap. Research progress

You will receive data on user behavior from the yandex_music_project.csv file. Nothing is known about the quality of the data. Therefore, before testing hypotheses, a review of the data is required.

Work plan: 
1. Data review. 
2. Data preprocessing. 
3. Hypothesis testing.

## Data review

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv')

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
  userID    65079 non-null object
Track       63848 non-null object
artist      57876 non-null object
genre       63881 non-null object
  City      65079 non-null object
time        65079 non-null object
Day         65079 non-null object
dtypes: object(7)
memory usage: 3.5+ MB


So the table has seven columns. The data type in all columns is `object`.

According to the data documentation:
* `userID` - user ID;
* `Track` — track name;
* `artist` — artist name;
* `genre` — genre name;
* `City` - user's city;
* `time` - start time of listening;
* `Day` is the day of the week.

There are three style violations in the column headings:
1. Lowercase letters are combined with uppercase.
2. There are gaps.
3. Not used "snake register"



The number of values in the columns varies. This means there are missing values in the data.


**Conclusions**

Each line of the table contains data about the track you have listened to. Some of the columns describe the composition itself: title, artist and genre. The rest of the data tells about the user: what city he is from, when he listened to music.

Preliminarily, it can be argued that there is enough data to test hypotheses. But there are gaps in the data, and discrepancies in the names of the columns with good style.

To move forward, you need to fix problems in the data.

## Data preprocessing

### Heading style

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

In [6]:
df = df.rename(columns={'  userID':'user_id', 'Track':'track','  City  ':'city','Day':'day'})

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

In [9]:
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
print(df.isna().sum())

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64


### Duplicates

In [11]:
print(df.duplicated().sum())

3826


In [12]:
df = df.drop_duplicates().reset_index(drop=True)

In [13]:
# проверка на отсутствие дубликатов
print(df.duplicated().sum())

0


In [14]:
genre_list = df.sort_values(by='genre', ascending = True)
genre_list = genre_list['genre'].unique()
print(genre_list)

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

In [15]:
def replace_wrong_genres (wrong_genres, correct_genre):
    for wrong_genres in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genres, correct_genre)

In [16]:
wrong_names = ['hip', 'hop', 'hip-hop']
right_name = 'hiphop'
replace_wrong_genres(wrong_names, right_name)

In [17]:
genre_list = df.sort_values(by='genre', ascending = True)
genre_list = genre_list['genre'].unique()
print(genre_list)

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

**Conclusions**

Preprocessing found three problems in the data:

- headline style violations,
- missing values,
- duplicates - explicit and implicit.

You've fixed the headers to make the table easier to work with. Without duplicates, the study will become more accurate.

You have replaced missing values with `'unknown'`. It remains to be seen whether the gaps in the `genre` column will harm the study.

Now we can move on to hypothesis testing. 

## Hypothesis testing

### Comparison of user behavior in Moscow and Saint-Petersburg

In [18]:
df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

There are more auditions in Moscow than in St. Petersburg. It does not follow from this that Moscow users listen to music more often. There are simply more users in Moscow.

In [19]:
df.groupby('day')['genre'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

On average, users from the two cities are less active on Wednesdays. But the picture may change if we consider each city separately.

In [20]:
def number_tracks (day, city):
    track_list = df[(df['day']==day) & (df['city']==city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count

In [21]:
number_tracks('Monday','Moscow')

15740

In [22]:
number_tracks('Monday','Saint-Petersburg')

5614

In [23]:
number_tracks('Wednesday','Moscow')

11056

In [24]:
number_tracks('Wednesday','Saint-Petersburg')

7003

In [25]:
number_tracks('Friday','Moscow')

15945

In [26]:
number_tracks('Friday','Saint-Petersburg')

5895

In [27]:
data = [['Moscow', 15740, 11056, 15945],
        ['Saint-Petersburg', 5614, 7003, 5895]] 
columns = ['city', 'monday', 'wednesday', 'friday'] 
table = pd.DataFrame(data = data, columns = columns) 
display(table)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, the peak of listening falls on Monday and Friday, and on Wednesday there is a noticeable decline.
- In St. Petersburg, on the contrary, they listen to music more on Wednesdays. Activity on Monday and Friday here is almost equally inferior to Wednesday.

So the data support the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday morning certain genres predominate in Moscow, while others dominate in St. Petersburg. Similarly, Friday evenings are dominated by different genres, depending on the city.

In [28]:
moscow_general = df[df['city'] == 'Moscow']
moscow_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday


In [29]:
spb_general = df[df['city'] == 'Saint-Petersburg']
spb_general.head()

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday


In [30]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[(genre_df['time'] > time1) & (genre_df['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['genre'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)

In [31]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [32]:
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [33]:
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [34]:
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
rock           216
electronic     216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can draw the following conclusions:

1. In Moscow and St. Petersburg they listen to similar music. The only difference is that the Moscow rating includes the “world” genre, while the St. Petersburg rating includes jazz and classical.

2. There were so many missing values ​​in Moscow that the value `'unknown'` took tenth place among the most popular genres. This means that missing values ​​occupy a significant share in the data and threaten the reliability of the study.

Friday night does not change this picture. Some genres rise a little higher, others go down, but overall the top 10 stays the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning of the week and at the end.
* The difference between Moscow and St. Petersburg is not very pronounced. In Moscow, they listen to Russian popular music more often, in St. Petersburg - jazz.

However, gaps in the data cast doubt on this result. There are so many of them in Moscow that the top 10 ranking could look different if it were not for the lost genre data.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap, the music of this genre is listened to more often than in Moscow. And Moscow is a city of contrasts, which, nevertheless, is dominated by pop music.

In [35]:
moscow_genres = moscow_general.groupby('genre').count().sort_values(by='track', ascending = False)

In [36]:
moscow_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,5892,5892,5892,5892,5892,5892
dance,4435,4435,4435,4435,4435,4435
rock,3965,3965,3965,3965,3965,3965
electronic,3786,3786,3786,3786,3786,3786
hiphop,2096,2096,2096,2096,2096,2096
classical,1616,1616,1616,1616,1616,1616
world,1432,1432,1432,1432,1432,1432
alternative,1379,1379,1379,1379,1379,1379
ruspop,1372,1372,1372,1372,1372,1372
rusrap,1161,1161,1161,1161,1161,1161


In [37]:
spb_genres = spb_general.groupby('genre').count().sort_values(by='track', ascending = False)

In [38]:
spb_genres.head(10)

Unnamed: 0_level_0,user_id,track,artist,city,time,day
genre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
pop,2431,2431,2431,2431,2431,2431
dance,1932,1932,1932,1932,1932,1932
rock,1879,1879,1879,1879,1879,1879
electronic,1736,1736,1736,1736,1736,1736
hiphop,960,960,960,960,960,960
alternative,649,649,649,649,649,649
classical,646,646,646,646,646,646
rusrap,564,564,564,564,564,564
ruspop,538,538,538,538,538,538
world,515,515,515,515,515,515


**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, in the top 10 genres there is a close genre - Russian popular music.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg.

## Research results

3 hypotheses were tested:

1. The day of the week has a different effect on the activity of users in Moscow and St. Petersburg.

The first hypothesis was fully confirmed.

2. Musical preferences do not change much during the week - be it Moscow or St. Petersburg. Small differences are noticeable at the beginning of the week, on Mondays:
* in Moscow they listen to music of the “world” genre,
* in St. Petersburg - jazz and classical music.

Thus, the second hypothesis was only partly confirmed. This result could have been different were it not for gaps in the data.

3. The tastes of users of Moscow and St. Petersburg have more in common than differences. Contrary to expectations, genre preferences in St. Petersburg resemble those in Moscow.

The third hypothesis was not confirmed. If there are differences in preferences, they are invisible to the bulk of users.