# Music and the Cities

Project on [Yandex Music](https://music.yandex.ru/) streaming service user behavior analysis: Comparing the music preferences of users from Moscow and St. Petersburg.

**Project status:** ✅сompleted, reviewed.

<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Project-Overview" data-toc-modified-id="Project-Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Project Overview</a></span></li><li><span><a href="#Project-Summary" data-toc-modified-id="Project-Summary-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Project Summary</a></span></li><li><span><a href="#Outline" data-toc-modified-id="Outline-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Outline</a></span></li><li><span><a href="#Reading-the-Data-set" data-toc-modified-id="Reading-the-Data-set-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Reading the Data set</a></span></li><li><span><a href="#Data-Preparation:-Cleaning-and-Formatting" data-toc-modified-id="Data-Preparation:-Cleaning-and-Formatting-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Data Preparation: Cleaning and Formatting</a></span></li><li><span><a href="#Hypothesis-testing" data-toc-modified-id="Hypothesis-testing-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Hypothesis testing</a></span><ul class="toc-item"><li><span><a href="#Hypothesis-1" data-toc-modified-id="Hypothesis-1-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Hypothesis 1</a></span></li><li><span><a href="#Hypothesis-2" data-toc-modified-id="Hypothesis-2-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>Hypothesis 2</a></span></li><li><span><a href="#Hypothesis-3" data-toc-modified-id="Hypothesis-3-6.3"><span class="toc-item-num">6.3&nbsp;&nbsp;</span>Hypothesis 3</a></span></li></ul></li><li><span><a href="#Research-Findings" data-toc-modified-id="Research-Findings-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Research Findings</a></span></li><li><span><a href="#Project-Completion-Checklist" data-toc-modified-id="Project-Completion-Checklist-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Project Completion Checklist</a></span></li></ul></div>

## Project Overview

**The goal of this project** was to test three hypotheses:
1. User activity depends on the day of the week, with variations between Moscow and St. Petersburg.
1. Different music genres prevail in Moscow and St. Petersburg during Monday mornings and Friday evenings.
1. Moscow and St. Petersburg exhibit distinct music genre preferences, with Moscow favoring pop music and St. Petersburg favoring Russian rap.

**The research objectives** were as follows:

1. To check the initial data for errors and assess their impact on the study.
1. To determine the feasibility of correcting the most critical data errors.
1. To verify the proposed hypotheses.

**The data set used for this analysis** was taken from the file `yandex_music_project.csv`, which contains the following data regarding user behavior on the Yandex Music platform:

- `userID` — user identifier;
- `Track` — track name;
- `artist` — artist's name;
- `genre` — genre name;
- `City` — user's city;
- `time` — start time of listening;
- `Day` — day of the week.

**Skills and tools used:** `Python`, `pandas`, `.groupby()`,  `.pivot_table()`.

[Top of this section](#Project-Overview) | [Project Contents](#Table-of-Contents)

## Project Summary

- Out of the three investigated hypotheses, the first one was fully confirmed, while the second and third were partially supported.


- The obtained results indicate that there is more in common than differences in the preferences of Moscow and St. Petersburg users. If there are differences in preferences, they are generally subtle and not noticeable among the majority of users.


- The significant share of missing values in the 'genre' field in the initial data casts doubt about the conclusions of the research regarding the second and third hypotheses.


**Future Work**

Data from a single service may not always be representative of an entire city's population. Gathering data from other streaming platforms and conducting hypothesis testing using statistical methods in the future will determine the reliability of these findings based on the available data.

[Top of this section](#Project-Summary) | [Project Contents](#Table-of-Contents)

## Outline

1. **Reading the Data set**.
    - Loading the `pandas` library and the dataset from the `yandex_music_project.csv` file.
    - Initial data examination using the `head()` and `info()` methods.
</br></br>
2. **Data Preparation: Cleaning and Formatting**.
    - Formatting column names with `df.columns` and `rename()` methods.
    - Handling explicit duplicates with `drop_duplicates()`.
    - Identifying and replacing implicit duplicates with `sort_values()`, `unique()`, and a custom function, `replace_wrong_genres(wrong_genres, correct_genre)`.
</br></br>
3. **Hypothesis Testing**
    - Conducting investigations to achieve project goals.
</br></br>
4. **Research Findings**
    - Summarizing key findings concisely.

[Top of this section](#Outline) | [Project Contents](#Table-of-Contents)

## Reading the Data set

In [1]:
# Import the pandas library as 'pd'
import pandas as pd

# Read the CSV file into the 'df' DataFrame
df = pd.read_csv('datasets/yandex_music_project.csv')

# Display the first 10 rows of 'df'
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [2]:
# Display 'df' DataFrame info.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


In [3]:
# Count missing values in each column.
df.isna().sum()

  userID       0
Track       1231
artist      7203
genre       1198
  City         0
time           0
Day            0
dtype: int64

**Dataset Exploration Summary**

- The dataset contains a total of 65,079 records. This volume of data is sufficient for hypothesis testing.


- The table consists of seven columns, all of which have the `object` data type.


- Column names have style inconsistencies:
    1. A mix of lowercase and uppercase letters.
    1. Spaces in column names
    1. Column names lack clarity in describing their content.
</br></br>
- There are missing values in the dataset.

[Top of this section](#Reading-the-Data-set) | [Project Contents](#Table-of-Contents)

## Data Preparation: Cleaning and Formatting

In [4]:
# 1. Rename columns to improve style and clarity.
df = df.rename(columns={'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

In [5]:
# 2. Replace missing values in specified columns with 'unknown'.
columns_to_replace = ['track', 'artist', 'genre']

for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

In [6]:
# 3. Count the number of duplicate rows in the DataFrame 'df'.
df.duplicated().sum()

3826

In [7]:
# 4. Remove duplicate rows from the DataFrame 'df' and reset the index.
df = df.drop_duplicates().reset_index(drop=True)

df.duplicated().sum()

0

In [8]:
# 5. Identify and replace implicit duplicates

# To identify implicit duplicates retrieve unique values in the 'genre' column, sorted in ascending order.
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Implicit duplicates found:
- *hip*, *hop*, *hip-hop*.
- *ïîï* — the genre "pop" in the WINDOWS-1251 encoding (this was determined using [that universal decoder](https://2cyr.com/decode/?lang=ru))
- *электроника* — is the genre "electronic" in the Russian language.

To replace implicit duplicates, we will create a function `replace_wrong_genres()` with two parameters:
- `wrong_genres`: a list of duplicates;
- `correct_genre`: a string with the correct value.

The function should correct the `genre` column in the `df` by replacing each value from the `wrong_genres` list with the value from `correct_genre`.

In [9]:
# Function for replacing implicit duplicates
def replace_wrong_genres (wrong_genres, correct_genre):
    for wrong_genre in wrong_genres:
        df['genre'] = df['genre'].replace(wrong_genre, correct_genre)
        
# Dictionary of implicit duplicates
genres_to_replace = {
    'hiphop' : ['hip', 'hop', 'hip-hop'],
    'pop' : ['ïîï'],
    'electronic' : ['электроника']
}

# Loop for replacing implicit duplicates
for correct_value in genres_to_replace:
    replace_wrong_genres(genres_to_replace[correct_value], correct_value)
    
# Check implicit duplicates replacement
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

**Data Preparation Summary**
- Adjusted column headers to snake_case.
- Filled missing values with 'unknown'.
- Removed complete duplicates.
- Addressed implicit duplicates.

[Top of this section](#Data-Preparation:-Cleaning-and-Formatting) | [Project Contents](#Table-of-Contents)

## Hypothesis testing

### Hypothesis 1
User activity depends on the day of the week, with variations between Moscow and St. Petersburg.

In [10]:
# Create a pivot table to analyze user activity by city and day.
pivot_table = df.pivot_table(index=['city'],
                             columns='day',
                             values='user_id',
                             aggfunc='count',
                             fill_value=0,
                             margins=True,
                             margins_name='Total')

# Clean column names in the pivot table.
pivot_table.columns = [''.join(str(s).strip() for s in col if s) for col in pivot_table.columns]

# Reset the index of the pivot table
pivot_table.reset_index(inplace=True)

# Select specific columns in the pivot table for analysis.
pivot_table = pivot_table[['city', 'Monday', 'Wednesday', 'Friday', 'Total']]

# Display the modified pivot table.
pivot_table

Unnamed: 0,city,Monday,Wednesday,Friday,Total
0,Moscow,15740,11056,15945,42741
1,Saint-Petersburg,5614,7003,5895,18512
2,Total,21354,18059,21840,61253


**Conclusions**
- In Moscow, the peak of listens falls on Monday and Friday, while there is a noticeable decline on Wednesday.
- In St. Petersburg, on the contrary, there is more music listening on Wednesdays. Activity on Monday and Friday is almost equally lower compared to Wednesday.

✅ The first hypothesis has been confirmed

[Top of this section](#Hypothesis-1) | [Project Contents](#Table-of-Contents)

### Hypothesis 2

Different music genres prevail in Moscow and St. Petersburg during Monday mornings and Friday evenings.

In [11]:
# 1. Splitting the dataset into two samples:
# `moscow_general` — data for Moscow, and
# `spb_general` — data for St. Petersburg.

moscow_general = df[df['city'] == 'Moscow']
spb_general = df[df['city'] == 'Saint-Petersburg']

# 2. Function that will generate a top-10 ranking of track genres
# listened to on a specified day within a given time interval.
def genre_weekday(table, day, time1, time2):
    genre_df = table[table['day'] == day]
    genre_df = genre_df[genre_df['time'] > time1]
    genre_df = genre_df[genre_df['time'] < time2]
    genre_df_count = genre_df.groupby('genre')['user_id'].count()
    genre_df_sorted = genre_df_count.sort_values(ascending=False)
    return genre_df_sorted.head(10)

# 3. Defining days of the week and time intervals for analysis. 
dt_interval = [('Monday', '07:00', '11:00'), ('Friday', '17:00', '23:00')]

# 4. Loop for ranking of track genres
for day_time in dt_interval:
    day, start, end = day_time
    for city_df in [spb_general, moscow_general]:
        print('Город: {} День: {} Время: {} - {}'.format(city_df['city'].unique()[0], day, start, end))
        print(genre_weekday(city_df, day, start, end))
        print('\n')

Город: Saint-Petersburg День: Monday Время: 07:00 - 11:00
genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: user_id, dtype: int64


Город: Moscow День: Monday Время: 07:00 - 11:00
genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: user_id, dtype: int64


Город: Saint-Petersburg День: Friday Время: 17:00 - 23:00
genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: user_id, dtype: int64


Город: Moscow День: Friday Время: 17:00 - 23:00
genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world    

**Conclusions**
- Musical preferences in Moscow and St. Petersburg exhibit minimal differences: Moscow includes the "world" genre in its ranking, while St. Petersburg features "jazz" and "classical music."
- The high prevalence of missing values in Moscow data elevates 'unknown' to the tenth position among the most popular genres, raising concerns about data credibility.
- Friday evening does not significantly alter this pattern; the top 10 genres largely remain consistent.

🟨 The second hypothesis has been partially confirmed.

[Top of this section](#Hypothesis-2) | [Project Contents](#Table-of-Contents)

### Hypothesis 3
Moscow and St. Petersburg exhibit distinct music genre preferences, with Moscow favoring pop music and St. Petersburg favoring Russian rap.

In [12]:
# Grouping Moscow's data by track genres in descending order of listening counts.
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

# Displaying the top 10 genres in Moscow.
moscow_genres.head(10)

genre
pop            5893
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [13]:
# Grouping St. Petersburg's data by track genres in descending order of listening counts.
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

# Displaying the top 10 genres in St. Petersburg.
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1737
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**
- Pop music reigns as the most popular genre in Moscow, as hypothesized. Additionally, Russian pop music features prominently within the top 10 genres.
- Contrary to expectations, rap enjoys equal popularity in both Moscow and St. Petersburg.

🟨 The third hypothesis has been partially confirmed.

[Top of this section](#Hypothesis-3) | [Project Contents](#Table-of-Contents)

## Research Findings

We examined three hypotheses and established the following:

1. The day of the week has varying effects on user activity in Moscow and St. Petersburg, fully confirming the first hypothesis.


2. Musical preferences remain relatively stable throughout the week, whether in Moscow or St. Petersburg. Minor differences emerge at the beginning of the week, on Mondays:
   - Moscow leans toward the "world" music genre.
   - St. Petersburg prefers jazz and classical music.
</br></br>
   Therefore, the second hypothesis was only partially confirmed. This result could have been different if not for data gaps.


3. The musical tastes of Moscow and St. Petersburg users exhibit more similarities than differences. Surprisingly, genre preferences in St. Petersburg resemble those in Moscow. The third hypothesis was not confirmed, suggesting that any existing differences in preferences are imperceptible to the majority of users.

These findings provide insights into the dynamics of user behavior and music preferences in the two cities, highlighting both similarities and subtle distinctions.

[Top of this section](#Research-Findings) | [Project Contents](#Table-of-Contents)

## Project Completion Checklist

**Step 1. Data Loading and Initial Exploration**
   - [x] Source data file opened
   - [x] File examined (first rows displayed, `info()` method used)
   - [x] Decision made regarding the need for data preprocessing

</br></br>
**Step 2. Data Preprocessing**
   - [x] Column headers standardized
   - [x] Missing data identified and filled with placeholder values
   - [x] Complete duplicates removed
   - [x] Implicit duplicates addressed  

</br></br>
**Step 3. Hypothesis Testing**
   - [x] First hypothesis tested
   - [x] Second hypothesis tested
   - [x] Third hypothesis tested
   - [x] Overall conclusion drawn for the entire study

</br></br>
[Project Contents](#Table-of-Contents)