# Yandex.Music

**Research Objective** — The goal is to test three hypotheses:

1. User activity varies by day of the week, with notable differences between Moscow and Saint Petersburg.
2. On Monday mornings, specific genres dominate in Moscow, while different genres are more prevalent in Saint Petersburg. Similarly, on Friday evenings, genre preferences differ depending on the city.
3. Moscow and Saint Petersburg favor different music genres. Pop music is more commonly listened to in Moscow, whereas rap music enjoys greater popularity in Saint Petersburg.

**Research Process**

The data on user behavior is stored in the file `yandex_music_project.csv`. The quality of the data is unknown, so a preliminary review is required before testing the hypotheses.

The data must be examined for errors, and the potential impact of these errors on the research needs to be assessed. During the preprocessing stage, the most critical data issues will be addressed.

The research will be conducted in three stages:

1. Data review
2. Data preprocessing
3. Hypothesis testing

## Data Review

In [1]:
import pandas as pd 

In [2]:
# Reading the data file and saving it to df
df = pd.read_csv('/datasets/yandex_music_project.csv')

In [3]:
# First 10 rows of the df table
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
# General information about the data
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


The table consists of 7 columns, all of which have the data type `object`.

According to the data documentation:

* `userID` — user identifier
* `Track` — track title
* `artist` — artist name
* `genre` — genre name
* `City` — user city
* `time` — time when listening started
* `Day` — day of the week

There are three style violations in the column names:

1. A mix of lowercase and uppercase letters.
2. The presence of spaces.
3. **Lack of separation between words using underscores (snake_case).**
Additionally, the number of values in the columns varies, indicating the presence of missing data.

**Intermediate Conclusions**

Each row in the table represents data about a listened track. Some columns provide details about the track itself, such as the title, artist, and genre, while others relate to the user, including their city and the time they started listening.

Preliminarily, it can be concluded that the data is sufficient to test the hypotheses. However, there are missing values, and the column names do not adhere to best style practices.

To proceed, it is necessary to resolve these data issues.

## Data Preprocessing

### Header Style

In [5]:
# List of column names in the df table
print(df.columns) 

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')


In [6]:
# Renaming the columns
df = df.rename(
    columns = {
        '  userID': 'user_id',
        'Track': 'track',
        '  City  ': 'city',
        'Day': 'day'
    }
)

In [7]:
# Checking the results - list of column names
print(df.columns)

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')


### Missing Values

In [8]:
# Counting missing values
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Not all missing values are critical to the research. For instance, missing data in `track` and `artist` does not significantly affect the analysis, and can simply be replaced with explicit placeholders.

However, missing values in the `genre` column could impact the comparison of musical preferences between Moscow and Saint Petersburg. Ideally, we would investigate the cause of these missing values and restore the data. Unfortunately, this is not feasible within the scope of this project. Therefore, we will:

* Replace the missing values with explicit placeholders,
* Evaluate the extent to which these replacements may influence the analysis.

In [9]:
# Iterating through column names and replacing missing values with 'unknown'
columns_to_replace = ['track','artist','genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

In [10]:
# Counting missing values in each column
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

In [11]:
# Counting explicit duplicates
df.duplicated().sum()

3826

In [12]:
# Removing explicit duplicates (with the removal of old indices and the creation of new ones)
df = df.drop_duplicates().reset_index() 

In [13]:
# Checking for the absence of duplicates
df.duplicated().sum()

0

In [14]:
# Viewing unique genre names
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

In [15]:
# Function for replacing implicit duplicates
def replace_wrong_genres(wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)

In [16]:
# Removing implicit duplicates
duplicats = ['hip', 'hop', 'hip-hop']
genre = 'hiphop'
replace_wrong_genres(duplicats, genre)

In [17]:
# Checking for implicit duplicates
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Intermediate Conclusions**

The data preprocessing identified three main issues:

* Inconsistent header styling,
* Missing values,
* Duplicates — both explicit and implicit.

We addressed the header styling to improve readability and facilitate table management. Removing duplicates will enhance the accuracy of the analysis.

Missing values were replaced with `'unknown'`. However, it remains to be determined whether the missing data in the `genre` column will impact the results of the study.

With these issues addressed, we are now ready to proceed with hypothesis testing.

## Hypothesis Testing

### Comparison of User Behavior in the Two Capitals

In [18]:
# Counting listens in each city
df.groupby('city')['time'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: time, dtype: int64

In Moscow, there are more listens than in Saint Petersburg. However, this does not imply that Moscow users listen to music more frequently; it simply reflects that there are more users in Moscow.

Next, we will group the data by day of the week and count the listens for Monday, Wednesday, and Friday. The data contains information only about listens on these days.

In [19]:
# Counting the number of listens for each of the three days
df.groupby('day')['time'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: time, dtype: int64

On average, users from both cities are less active on Wednesdays. However, the pattern may change when each city is analyzed individually.

Now, let's write a function that will combine both of these calculations.

We will create a function called `number_tracks()` that will count the listens for a specific day and city. It will require two parameters:

* day of the week,
* city name.

In the function, we will filter the original table to get the rows where:

* the `day` column equals the day parameter, and
* the `city` column equals the city parameter.

This will be done using sequential filtering with logical indexing.

Then, we will count the values in the `user_id` column of the resulting table. The result will be stored in a new variable, which will be returned by the function.

In [20]:
def number_tracks(day, city):
    track_list = df[(df['day'] == day) & (df['city'] == city)]
    track_list_count = track_list['user_id'].count()
    return track_list_count 

In [21]:
# The number of listens in Moscow on Mondays
number_tracks('Monday', 'Moscow')

15740

In [22]:
# The number of listens in Saint Petersburg on Mondays
number_tracks('Monday', 'Saint-Petersburg')

5614

In [23]:
# The number of listens in Moscow on Wednesdays
number_tracks('Wednesday','Moscow')

11056

In [24]:
# The number of listens in Saint Petersburg on Wednesdays
number_tracks('Wednesday', 'Saint-Petersburg')

7003

In [25]:
# The number of listens in Moscow on Fridays
number_tracks('Friday','Moscow')

15945

In [26]:
# The number of listens in Saint Petersburg on Fridays
number_tracks('Friday','Saint-Petersburg')

5895

Let's create a table using the `pd.DataFrame` constructor, where:

* Column names are `['city', 'monday', 'wednesday', 'friday']`
* The data consists of the results obtained using the `number_tracks` function.

In [27]:
# Table with results
columns = ['city', 'monday', 'wednesday', 'friday']
data = [
    ['Moscow', 15740, 11056, 15945],
    ['Saint-Petersburg', 5614, 7003, 5895]
]
user_behavior = pd.DataFrame(data=data, columns=columns)
display(user_behavior)

Unnamed: 0,city,monday,wednesday,friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Intermediate Conclusions**

The data reveals differences in user behavior:

- In Moscow, the peak of listens occurs on Monday and Friday, with a noticeable drop on Wednesday.
- In contrast, in Saint Petersburg, users listen to music more on Wednesdays, with Monday and Friday showing almost equal, lower levels of activity compared to Wednesday.

Thus, the data supports the first hypothesis.

### Music at the beginning and end of the week

According to the second hypothesis, on Monday mornings, certain genres dominate in Moscow, while different genres are more popular in Saint Petersburg. Similarly, on Friday evenings, distinct genres prevail depending on the city.

We will store the data tables in two variables:

* For Moscow — in `moscow_general`;
* For Saint Petersburg — in `spb_general`.

In [28]:
# Retrieve the moscow_general table from the rows of the df table 
# where the value in the 'city' column is 'Moscow'
moscow_general = df[df['city'] == 'Moscow']

In [29]:
# Retrieve the spb_general table from the rows of the df table 
# where the value in the 'city' column is 'Saint-Petersburg'
spb_general = df[df['city'] == 'Saint-Petersburg']

We will create the `genre_weekday()` function with four parameters:

* a data table (dataframe) with the data,
* the day of the week,
* the start timestamp in the 'hh:mm' format,
* the end timestamp in the 'hh:mm' format.

The function should return information about the top 10 genres of the tracks listened to on the specified day, within the time range defined by the two timestamps.

In [30]:
def genre_weekday(table, day, time1, time2):
    genre_df = table[(table['day'] == day) & (table['time'] > time1) & (table['time'] < time2)]
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending = False)
    return genre_df_sorted.head(10)

Let's compare the results of the `genre_weekday()` function for Moscow and Saint Petersburg on Monday morning (from 7:00 to 11:00) and Friday evening (from 17:00 to 23:00):

In [31]:
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64

In [32]:
# Calling the function for Monday morning in Saint Petersburg
genre_weekday(spb_general, 'Monday', '07:00', '11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64

In [33]:
# Calling the function for Friday evening in Moscow
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: user_id, dtype: int64

In [34]:
# Calling the function for Friday evening in Saint Petersburg
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64

**Intermediate Conclusions**

When comparing the top 10 genres on Monday morning, we can draw the following conclusions:

1. Music preferences in Moscow and Saint Petersburg are quite similar. The only difference is that the Moscow ranking includes the "world" genre, while the Saint Petersburg ranking features jazz and classical music.
2. In Moscow, the number of missing values was so significant that the placeholder `'unknown'` secured a spot as the tenth most popular genre. This indicates that missing values make up a substantial portion of the data, threatening the reliability of the study.

Friday evening does not alter this pattern. Some genres rise slightly, others fall, but overall the top 10 remains unchanged.

Thus, the second hypothesis is only partially confirmed:

* Users listen to similar music at the beginning and end of the week.
* The difference between Moscow and Saint Petersburg is not very pronounced. Moscow leans more toward Russian pop music, while Saint Petersburg favors jazz.

However, the missing data casts doubt on this result. In Moscow, the number of missing values is so high that the top 10 could look different if the missing genre data were available.

### Genre Preferences in Moscow and Saint Petersburg

**Hypothesis**: Saint Petersburg is the rap capital, where this genre is listened to more frequently than in Moscow. Meanwhile, Moscow, a city of contrasts, still predominantly listens to pop music.

Group the `moscow_general` table by genre and count the number of track listens for each genre using the `count()` method. Then, sort the result in descending order and save it in the `moscow_genres table`.

In [35]:
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [36]:
# View the first 10 rows of moscow_genres
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

Now let's do the same for Saint Petersburg.

We will group the `spb_general` table by genre, count the track listens for each genre, sort the results in descending order, and store the result in the `spb_genres` table.

In [37]:
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending = False)

In [38]:
# View the first 10 rows of spb_genres
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Intermediate Conclusions**

The hypothesis was partially confirmed:

* Pop music is the most popular genre in Moscow, as the hypothesis suggested. Moreover, a closely related genre—Russian pop music—also appears in the top 10 genres.
* Contrary to expectations, rap is equally popular in both Moscow and Saint Petersburg.


## Research Findings

We tested three hypotheses and found the following:

1. **The Day of the Week Affects User Activity Differently in Moscow and Saint Petersburg.**
The first hypothesis was fully confirmed.


2. **Musical Preferences Do Not Change Significantly Throughout the Week—Whether in Moscow or Saint Petersburg.**
Small differences are noticeable early in the week, specifically on Mondays:
* In Moscow, users listen to "world" music,
* In Saint Petersburg, jazz and classical music are more popular.

Thus, the second hypothesis was only partially confirmed. This result could have been different if the data had not contained missing values.

3. **User Preferences in Moscow and Saint Petersburg Have More in Common Than Differences.**

Contrary to expectations, the genre preferences in Saint Petersburg resemble those of Moscow.

The third hypothesis was not confirmed. If any differences in preferences exist, they are not noticeable among the majority of users.