# Yandex. Music

On a dataset of Yandex Music, you can compare the behavior of users in two capitals.

**The purpose of the research is to test three hypotheses**

1. User activity depends on the day of the week. Moreover, it manifests differently in Moscow and St. Petersburg.
2. On Monday morning in Moscow one genres prevail, and in St. Petersburg - another. Similarly, on Friday evening different genres prevail - depending on the city. 
3. Moscow and St. Petersburg prefer different genres of music. In Moscow, pop music is more often listened to, in St. Petersburg - Russian rap.

**The course of the investigation**

You will get data on user behavior from the file `yandex_music_project.csv`. Nothing is known about the quality of the data. Therefore, an overview of the data will be needed before testing hypotheses.

You should check the data for errors and assess their impact on the research. Then, at the preprocessing stage, you should look for ways to fix the most critical data errors.
 
Thus, the research will be conducted in three stages:
1. Data review.
2. Data preprocessing.
3. Hypothesis testing.



## Data review




In [1]:
import pandas as pd

In [2]:
df = pd.read_csv('/datasets/yandex_music_project.csv')

Let's display the first ten rows of the table on the screen:

In [3]:
display(df.head(10))

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


Let's display the full information of the table on the screen:

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Thus, there are seven columns in the table. The data type in all columns is `object`.

According to the data documentation:
* `UserID` — identifier of the user;
* `Track` — name of the track;
* `Artist` — name of the artist;
* `Genre` — name of the genre;
* `City` — city of the user;
* `time` — the time of the start of the listening;
* `Day` — day of the week.

In the column titles there are three style violations:
1. Lowercase letters are combined with uppercase letters.
2. There are spaces.
3. The column titles 'time', 'Day' do not reflect the content of the columns in the title.

The number of values in the columns varies. This means that there are missing values in the data.


**Conclusions:**

In each row of the table are data about the listened track. Part of the columns describe the composition itself: title, performer and genre. The other data tell about the user: from which city he listened to the music and when.

It can be preliminarily asserted that there is enough data to verify the hypothesis. However, there are gaps in the data and discrepancies in the column titles with good style. 

To move forward, it is necessary to eliminate the problems in the data.

## Data preprocessing

Let's fix the style of the column headers, exclude any gaps. Then let's check the data for duplicates.

### The style of the column

Let's display the column names on the screen:

In [7]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Let's bring the titles in line with good style:
* write several words in the title in "snake_case",
* make all characters lowercase,
* remove spaces.

Let's rename the columns as follows:
* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [8]:
df = df.rename(columns = {'  userID':'user_id','Track':'track','  City  ':'city','Day':'day'})

Let's check the result. To do this, let's display the column names again:

In [9]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

Let's count how many missing values there are in the table:

In [10]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values affect the research. So in `track` and `artist` the missing values are not important for your work. It is enough to replace them with explicit symbols.

But gaps in the `genre` can impede comparison of musical tastes in Moscow and St. Petersburg. In practice, it would be correct to establish the cause of the gaps and restore the data. Such an opportunity is not available in the educational project. 
It will have to be done:
* fill in these gaps with explicit symbols, 
* assess how much they will damage the calculations.

Let's replace the missing values in the `track`, `artist` and `genre` columns with the string `'unknown'`. We will create a list `columns_to_replace`, loop through its elements with a `for` loop, and for each column, we will replace the missing values:

In [11]:
columns_to_replace = ['track','artist','genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Let's make sure there are no gaps left in the table. To do this, let's count the missing values again.

In [12]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Dublicates


Count explicit duplicates in the table in one command:

In [13]:
df.duplicated().sum()

3826

Let's remove the obvious duplicates:

In [14]:
df = df.drop_duplicates().reset_index(drop=True)

Let's count the explicit duplicates in the table again - let's make sure we have completely gotten rid of them:

In [15]:
df.duplicated().sum()

0

Now let's get rid of the implicit duplicates in the `genre` column. For example, the same genre can be written down a bit differently. Such errors will also affect the research result.

Let's display a list of unique genre names sorted in alphabetical order.

In [16]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Let's eliminate implicit duplicates using the function.

In [17]:
def replace_wrong_genres(wrong_genres,correct_genre):
    for wrong_genres in wrong_genres:
         df['genre'] = df['genre'].replace(wrong_genres, correct_genre)

In [18]:
wrong_genres = ['hip', 'hop', 'hip-hop']
correct_genre = 'hiphop'
replace_wrong_genres(wrong_genres, correct_genre)

Let's check that we have replaced the incorrect titles.

In [19]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Conclusions**

Preprocessing revealed three problems in the data:

- violations in the style of headers,
- missing values,
- duplicates - both explicit and implicit.

The identified issues in the data have been resolved and now we can proceed to hypothesis testing. 

## Hypothesis testing

### Compare the behavior of users in two capitals

The first hypothesis states that users listen to music differently in Moscow and St. Petersburg. Let's check this assumption based on data from three days of the week - Monday, Wednesday, and Friday.

Let's assess the activity of users in each city. We will group the data by city and count the listens in each group.

In [20]:
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

In Moscow, there are more listens than in St. Petersburg. This does not mean that Moscow users listen to music more often. It is simply that there are more users in Moscow.

Now let's group the data by day of the week and count the listens on Mondays, Wednesdays, and Fridays. Note that the data only contains information about listens on these days.

In [21]:
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from two cities are less active on Wednesdays. However, the picture may change if each city is considered separately.

Let's write a function that will combine these two calculations.

In [22]:
def number_tracks(day,city):
    track_list = df[df['day']==day]
    track_list = track_list[track_list['city']==city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Let's look at the data for each city for each of the three days.

In [23]:
number_tracks('Monday','Moscow')

15740

In [24]:
number_tracks('Monday','Saint-Petersburg')

5614

In [25]:
number_tracks('Wednesday','Moscow')

11056

In [26]:
number_tracks('Wednesday','Saint-Petersburg')

7003

In [27]:
number_tracks('Friday','Moscow')

15945

In [28]:
number_tracks('Friday','Saint-Petersburg')

5895

Let's create a summary table.

In [29]:
data = [['Moscow',15740,11056,15945],['Saint-Petersburg',5614,7003,5895]]
columns = ['city', 'monday', 'wednesday', 'friday']
table = pd.DataFrame(data = data, columns = columns)
display(table)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conclusions**

The data shows the difference in user behavior:

- In Moscow, peak listening activity is on Mondays and Fridays, while Wednesdays show a decrease.
- In St. Petersburg, on the other hand, music is more actively listened to on Wednesdays. Activity on Mondays and Fridays here is almost equally lower than on Wednesdays.

The data suggests that the first hypothesis is correct.

### Music at the beginning and end of the week

According to the second hypothesis, different genres predominate in the morning on Monday in Moscow and in St. Petersburg. Similarly, different genres predominate in the evening on Friday, depending on the city.

Let's save the data tables in two variables:
* in Moscow — `moscow_general`;
* in St. Petersburg — `spb_general`.

In [30]:
moscow_general = df[df['city']=='Moscow'] 
display(moscow_general)


Unnamed: 0,user_id,track,artist,genre,city,time,day
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,unknown,ruspop,Moscow,09:17:40,Friday
...,...,...,...,...,...,...,...
61247,83A474E7,I Worship Only What You Bleed,The Black Dahlia Murder,extrememetal,Moscow,21:07:12,Monday
61248,729CBB09,My Name,McLean,rnb,Moscow,13:32:28,Wednesday
61250,C5E3A0D5,Jalopiina,unknown,industrial,Moscow,20:09:26,Friday
61251,321D0506,Freight Train,Chas McDevitt,rock,Moscow,21:43:59,Friday


In [31]:
spb_general = df[df['city']=='Saint-Petersburg']
display (spb_general)

Unnamed: 0,user_id,track,artist,genre,city,time,day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
9,E772D5C0,Pessimist,unknown,dance,Saint-Petersburg,21:20:49,Wednesday
...,...,...,...,...,...,...,...
61239,D94F810B,Theme from the Walking Dead,Proyecto Halloween,film,Saint-Petersburg,21:14:40,Monday
61240,BC8EC5CF,Red Lips: Gta (Rover Rework),Rover,electronic,Saint-Petersburg,21:06:50,Monday
61241,29E04611,Bre Petrunko,Perunika Trio,world,Saint-Petersburg,13:56:00,Monday
61242,1B91C621,(Hello) Cloud Mountain,sleepmakeswaves,postrock,Saint-Petersburg,09:22:13,Monday


Let's create a function `genre_weekday()` with four parameters:
* dataframe,
* day,
* The initial timestamp is in the format 'hh:mm', 
* The end timestamp is in the format 'hh:mm'.

The function should return information about the top 10 genres of the tracks that were listened to on the specified day, in the time period between two time stamps.

In [32]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day']==day]
    genre_df = genre_df[genre_df['time']>time1]
    genre_df = genre_df[genre_df['time']<time2]
    genre_df_count = genre_df.groupby('genre')['city'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)

Let's compare the results of the genre_weekday() function for Moscow and St. Petersburg on Monday morning (from 7:00 to 11:00) and on Friday evening (from 17:00 to 23:00):

In [33]:
genre_weekday(moscow_general,'Monday','07:00','11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: city, dtype: int64

In [34]:
genre_weekday(spb_general,'Monday','07:00','11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: city, dtype: int64

In [35]:
genre_weekday(moscow_general,'Friday','17:00','23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: city, dtype: int64

In [36]:
genre_weekday(spb_general,'Friday','17:00','23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: city, dtype: int64

**Conclusions**

If we compare the top 10 genres on Monday morning, we can make such conclusions:

1. In Moscow and St. Petersburg, similar music is listened to. The only difference is that the Moscow rating included the genre "world" while the St. Petersburg rating included jazz and classical.

2. In Moscow, there were so many missing values that the value 'unknown' took tenth place among the most popular genres. This means that missing values take up a significant share of the data and threaten the accuracy of the research.

Friday night doesn't change this picture. Some genres rise a little higher, others drop, but overall the top-10 remains the same.

Thus, the second hypothesis was only partially confirmed:
* Users listen to similar music at the beginning and end of the week.
* The difference between Moscow and St. Petersburg is not too pronounced. In Moscow, Russian popular music is more often listened to, in St. Petersburg - jazz.

However, the gaps in the data cast doubt on this result. In Moscow, there are so many of them that the top-10 ranking could have looked different if the data on genres had not been lost.

### Genre preferences in Moscow and St. Petersburg

Hypothesis: St. Petersburg is the capital of rap music, and it is listened to more often there than in Moscow. Moscow, on the other hand, is a city of contrasts, where pop music predominates nonetheless.

Let's group the `moscow_general` table by genre and count the track plays of each genre using the `count()` method. Then we will sort the result in descending order and save it in the `moscow_genres` table.

In [37]:
moscow_general_count = moscow_general.groupby('genre')['user_id'].count()
moscow_genres = moscow_general_count.sort_values(ascending=False)

In [38]:
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: user_id, dtype: int64

In [39]:
spb_general_count = spb_general.groupby('genre')['user_id'].count()
spb_genres = spb_general_count.sort_values(ascending=False)

In [40]:
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: user_id, dtype: int64

**Conclusions**

The hypothesis was partially confirmed:
* Pop music is the most popular genre in Moscow, as the hypothesis predicted. Moreover, a closely related genre, Russian popular music, is in the top 10 genres.
* Contrary to expectations, rap is equally popular in Moscow and St. Petersburg. 


## The results of the research

We tested three hypotheses and found:

1. The day of the week has a different effect on user activity in Moscow and St. Petersburg.  

The first hypothesis was fully confirmed.

2. Musical preferences don't change much throughout the week, whether it be in Moscow or St. Petersburg. Small differences can be noticed at the beginning of the week, on Mondays:
* in Moscow they listen to music of the "world" genre,
* in St. Petersburg they listen jazz and classic.

Thus, the second hypothesis was only partially confirmed. This result could have been different if there were not gaps in the data.

3. The tastes of Moscow and St. Petersburg users have more in common than differences. Contrary to expectations, the genre preferences in St. Petersburg resemble those in Moscow. 

The third hypothesis was not confirmed. If there are differences in preferences, they are not noticeable in the majority of users.