# User Preferences Across Cities on Yandex.Music

We will use data from the Yandex.Music service to compare the listening preferences of users from Moscow and St. Petersburg.

There are three major hypotheses to check:

1. User activity changes depending on the day of the week, with the changes being different in Moscow and St. Petersburg.
2. Users show different genre preferences on Monday mornings in Moscow and St. Petersburg. They also have different genre preferences on Friday evenings.
3. Genre preferences overall are different as well. Moscovites prefer pop, while Peterburgians prefer Russian rap.

The main purpose of this exercise is to familiarize ourselves with using Pandas for basic data analysis.

## Data Overview

We start by importing our libraries.

In [1]:
import pandas as pd

Then we import the data into a Pandas dataframe.

In [2]:
df = pd.read_csv('D://Documents/Courses/Yandex.Practicum - Data Science/Projects/1 - Music Tastes Between Cities/yandex_music_project.csv')

Let's take a look at the results.

In [3]:
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Our dataframe contains a list of tracks played, with the following information available about them:

* `userID` - the ID of the user;
* `Track` - the track name;
* `artist` - the artist's name;
* `genre` - the genre name;
* `City` - the name of the city the user is in;
* `time` - the time of day the track was played;
* `Day` - the day of the week the track was played.

The column names themselves are not all properly formatted:

1. Uppercase letters are mixed with lowercase ones.
2. Spaces are present.
3. snake_case isn't used.

The number of entries across columns varies, which means that we have to process N/As.

### Conclusions

Each dataframe row contains the data about each track played. Some of the columns describe track properties: tack name, artist name, and the genre. The rest of the columns describe the user and their activity: user ID, where the user was located when the track was played, and when the track was played.

The data itself is limited: we only have data for Mondays, Wednesdays, and Fridays.

At the moment, it seems that we have sufficient data to test our hypotheses. However, before moving on, we need to do pre-processing: fix the column names, and deal with the missing data.

## Data Preprocessing

### Column Name Styles

Let's take a look at the full names of the columns.

In [5]:
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

Several column names are fine according to style conventions: 
* `artist`
* `genre`
* `time`

The rest of them need to be fixed to the following rules:

* for multiple words, use snake_case
* avoid using uppercase
* remove spaces

This means we need to do the following changes:

* `'  userID'` → `'user_id'`;
* `'Track'` → `'track'`;
* `'  City  '` → `'city'`;
* `'Day'` → `'day'`.

In [6]:
df = df.rename(columns = {'  userID': 'user_id', 'Track': 'track', '  City  ': 'city', 'Day': 'day'})

Let's check the results by taking a look at column names again.

In [7]:
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

#### Conclusion

We've changed the column names according to style standards.

### Missing Values

First, let's check how many missing entries we have in our dataframe.

In [8]:
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

For the purposes of our hypothesis testing, not all missing values are important. For example, missing `track` and `artist` values will not affect our study at all, so we can replace them with `'unknown'`.

However, missing values in `genre` can confuse our comparison between Moscow and St. Petersburg music tastes. Ideally, we would contact the colleagues who gave us the data and try to restore as much data as we could, but that's not an option at the moment. Therefore, we need to replace these missing values with `'unknown'` and then check how much they affect our comparison.

In [9]:
#create a temporary list with the columns to replace n/a
columns_to_replace = ['track', 'artist', 'genre']

#fill the n/a values with 'unknown'
for column in columns_to_replace:
    df[column]=df[column].fillna('unknown')

Let's check the results.

In [10]:
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

#### Conclusion

Missing values in the `track`, `artist`, and `genre` columns were replaced with `'unknown'`. No further missing values were found.

### Duplicate Values

Let's check the number of blatant duplicates.

In [11]:
df.duplicated().sum()

3826

Excellent, let's drop these.

In [12]:
df = df.drop_duplicates().reset_index(drop=True)

Check results.

In [13]:
df.duplicated().sum()

0

Now we need to look through the `genre` column and see whether we have non-obvious duplicates there. Since the same genre name can be written in several ways, we can see fewer entries for each genre than we actually have. 

Let's take a look at a list of genre names that we have.

In [14]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

Immediately we can see that there are non-obvious duplicates with `hiphop`:

* `hip`
* `hop`
* `hip-hop`

Let's write a function that will replace an incorrect genre name.

In [15]:
def replace_wrong_genre(line):
    
    #define the wrong genres and the correct genres
    wrong_genres = ['hip','hop','hip-hop']
    correct_genre = 'hiphop'
    
    #extract the genre from the line
    curr_genre = line['genre']
    
    #see if the current genre is present in the list of wrong genres, and return the correct genre if true
    if curr_genre in wrong_genres:
        return correct_genre
    
    #if not, return the current genre as is
    else:
        return curr_genre

And use the function to replace non-obvious duplicates for hiphop.

In [16]:
df['genre'] = df.apply(replace_wrong_genre, axis=1)

Let's check out the results.

In [17]:
df['genre'].sort_values().unique()

array(['acid', 'acoustic', 'action', 'adult', 'africa', 'afrikaans',
       'alternative', 'alternativepunk', 'ambient', 'americana',
       'animated', 'anime', 'arabesk', 'arabic', 'arena',
       'argentinetango', 'art', 'audiobook', 'author', 'avantgarde',
       'axé', 'baile', 'balkan', 'beats', 'bigroom', 'black', 'bluegrass',
       'blues', 'bollywood', 'bossa', 'brazilian', 'breakbeat', 'breaks',
       'broadway', 'cantautori', 'cantopop', 'canzone', 'caribbean',
       'caucasian', 'celtic', 'chamber', 'chanson', 'children', 'chill',
       'chinese', 'choral', 'christian', 'christmas', 'classical',
       'classicmetal', 'club', 'colombian', 'comedy', 'conjazz',
       'contemporary', 'country', 'cuban', 'dance', 'dancehall',
       'dancepop', 'dark', 'death', 'deep', 'deutschrock', 'deutschspr',
       'dirty', 'disco', 'dnb', 'documentary', 'downbeat', 'downtempo',
       'drum', 'dub', 'dubstep', 'eastern', 'easy', 'electronic',
       'electropop', 'emo', 'entehno', '

The rest of the genres aren't particularly revelant to our hypotheses, outside of `'russian'`, since those could include Russian rap. Let's compare the sizes of those entries.

In [18]:
df[df['genre'] == 'russian']

Unnamed: 0,user_id,track,artist,genre,city,time,day
555,5D1C77EF,Пусти меня ты мама,unknown,russian,Moscow,21:05:08,Monday
831,D8F36F8B,Я живу не унываю,Эдуард Хуснутдинов,russian,Saint-Petersburg,08:24:46,Friday
1040,DCE8BCFE,Надежды маленький оркестрик,Олег Погудин,russian,Moscow,13:46:54,Wednesday
4760,FAD58FB5,Пара-па-парам,Stereopulse,russian,Moscow,09:36:46,Wednesday
6393,8BC092B4,Волчий гон,Михаил Борисов,russian,Moscow,09:14:08,Friday
...,...,...,...,...,...,...,...
57157,5201558A,Привет от Вороваек,Воровайки,russian,Moscow,08:18:25,Monday
57985,B0DCBF6C,Я по тебе скучаю,unknown,russian,Moscow,13:20:34,Friday
59152,F0E65B40,Встречайте кореша,Запретка,russian,Moscow,09:52:18,Wednesday
59157,7450A38F,Пироги,Юлия Морозова,russian,Moscow,08:29:27,Friday


In [19]:
df[df['genre'] == 'rusrap']

Unnamed: 0,user_id,track,artist,genre,city,time,day
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
40,77979A66,Верните в моду любовь,Hazard,rusrap,Saint-Petersburg,08:45:43,Monday
100,6ED43067,Сочи,unknown,rusrap,Moscow,08:15:38,Friday
105,9A133297,Я настроен серьёзно,unknown,rusrap,Saint-Petersburg,14:32:56,Friday
107,E4F17EF2,Патефон,TM13,rusrap,Moscow,21:37:53,Wednesday
...,...,...,...,...,...,...,...
61004,A9F6AB4A,Наше лето,unknown,rusrap,Moscow,13:22:21,Wednesday
61131,443EFE83,Дорога,unknown,rusrap,Saint-Petersburg,21:53:00,Friday
61183,AF8479D7,Necronomicon,unknown,rusrap,Saint-Petersburg,20:22:43,Monday
61209,27047C8,Револьвер,unknown,rusrap,Saint-Petersburg,21:52:57,Wednesday


The `'russian'` entries include obviously `'shanson'` artists, like "Воровайки", and are also not very numerous (90 vs. 1725 entries). We can ignore these entries, they're not large enough to have a noticeable effect on our hypothesis testing.

#### Conclusion

Obvious duplicates were removed (3,826 lines). Non-obvious duplicates for `'hiphop'` in the `genre` column (`'hip'`, `'hop'`, `'hip-hop'`) were brought to the standard form.

Important note: other genres weren't fixed, since they're not likely to affect our hypotheses. 

## Hypothesis Testing

### Comparing the Behaviour of Muscovites and Petersburgians

Our first hypothesis is that users in Moscow and St. Petersburg have different music listening activities. Let's compare them across the three days we have data for: Monday, Wednesday, and Friday. 

In [20]:
df.groupby('city')['genre'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: genre, dtype: int64

We can see that there have been more users playing music in Moscow than in St. Petersburg. However, we can't assume that this meants Moscovites listen to music more than Peterburgians: there can simply be more users in Moscow.

Let's now take a look at how the data is split across days.

In [21]:
df.groupby('day')['genre'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: genre, dtype: int64

We can see that there is a drop on listening on Wednesdays compared to Mondays and Fridays.

Now let's write a function that will return the number of tracks played for a given city and day.

In [22]:
def number_tracks (day, city):
    
    #create a temporary dataframe that only includes the right day and city
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    
    #return the total count of entries in our temporary dataframe
    return track_list['user_id'].count()

Let's use `number_tracks` to create a table that has the raw numbers across cities and days.

In [23]:
#the name for the columns in our display dataframe
column_names = ['city', 'Monday', 'Wednesday', 'Friday']

#generate the dataframe
display_data = {'city' : ['Moscow', 'Saint-Petersburg'],
               'Monday' : [number_tracks('Monday', 'Moscow'), number_tracks('Monday', 'Saint-Petersburg')],
               'Wednesday' : [number_tracks('Wednesday', 'Moscow'), number_tracks('Wednesday', 'Saint-Petersburg')],
               'Friday' : [number_tracks('Friday', 'Moscow'), number_tracks('Friday', 'Saint-Petersburg')]}
display_table = pd.DataFrame(display_data, columns = column_names)

#display it
display_table

Unnamed: 0,city,Monday,Wednesday,Friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


However, this table isn't as clear as it could be for the client. Let's create one that will instead show each day as a percentage of the total plays per city.

In [24]:
#make a function to create a list of percentages per city
def daily_listens_as_percent_of_weekly_total(city):
    
    #create a blank list for the percentages
    daily_percentages = []
    
    #calculate the total number of plays in the given city
    total_listens = df[df['city'] == city]['user_id'].count()
    
    #do percentage calculations for each day
    for day in ['Monday', 'Wednesday', 'Friday']:
        #append the rounded percentage for each day to daily_percentages
        daily_percentages.append(round((number_tracks(day,city)/total_listens)*100,1))  
    
    #return the whole list
    return daily_percentages

In [25]:
#calculate the percentages for each city
moscow_daily_percentages = daily_listens_as_percent_of_weekly_total('Moscow')
spb_daily_percentages = daily_listens_as_percent_of_weekly_total('Saint-Petersburg')

In [26]:
#creating our columns for the dataframe
percentage_columns_list = []
for i in range(len(moscow_daily_percentages)):
    percentage_columns_list.append([moscow_daily_percentages[i],spb_daily_percentages[i]])
    
#creating the names for the columns
display_as_percentage_column_names = ['city', 'Monday (%)', 'Wednesday (%)', 'Friday (%)']

#getting the data for the dataframe
display_as_percentage_data = {'city' : ['Moscow', 'Saint-Petersburg'],
                             'Monday (%)' : percentage_columns_list[0],
                             'Wednesday (%)' : percentage_columns_list[1],
                             'Friday (%)' : percentage_columns_list[2]}

#putting the dataframe together and displaying it
display_as_percentage_table = pd.DataFrame(display_as_percentage_data, columns=display_as_percentage_column_names)
display_as_percentage_table

Unnamed: 0,city,Monday (%),Wednesday (%),Friday (%)
0,Moscow,36.8,25.9,37.3
1,Saint-Petersburg,30.3,37.8,31.8


Excellent, this way, we can see the percentage breakdown across days and cities.

#### Conclusions

**The data supports the first hypothesis**: user activity across days is indeed different in each city.

* In Moscow, our plays spike on Mondays and Fridays, and drop on Wednesdays.
* Conversely, in St. Petersburg, the largest spike is on Wednesdays, while user activity drops on Mondays and Fridays in nearly equal amounts.

### Comparing Monday Morning and Friday Night Genres Across Cities

Our second hypothesis states: Users show different genre preferences on Monday mornings in Moscow and St. Petersburg. They also have different genre preferences on Friday evenings.

Let's break down our data into separate tables for Moscow and St. Petersburg.

In [27]:
moscow_general = df[df['city'] == 'Moscow']
spb_general = df[df['city'] == 'Saint-Petersburg']

Now let's make a function that will return the top 5 genres played on a given day between given times in a given table.

In [28]:
#the function accepts a table, a day, and start and end times
def genre_weekday (table, day, time_start, time_end):
    
    #create a dataframe from table that matches the day and time given
    genre_df = table[(table['day'] == day) & (table['time'] > time_start) & (table['time'] < time_end)]
    
    #group genre_df by genre, count the popularity of each genre, and sort them in a descending order
    genre_df_sorted = genre_df.groupby('genre')['track'].count().sort_values(ascending=False)
    
    #create a list of the top 5 most popular genres
    top_five_list = list(genre_df_sorted.head(5).index)
    
    #return the results as a list
    return top_five_list

Let's apply our function to create a dataframe that shows our top 5 entries across cities for Monday mornings and Friday nights.

In [29]:
#calculate the top 5 genres for each city for each day
moscow_genres = []
spb_genres = []

#use genre_weekday to get our list, but convert it to concatenated strings using join() and map()

#monday morning
moscow_genres.append(", ".join(map(str, genre_weekday(moscow_general, 'Monday', '07:00', '11:00'))))
spb_genres.append(", ".join(map(str, genre_weekday(spb_general, 'Monday', '07:00', '11:00'))))

#friday night
moscow_genres.append(", ".join(map(str, genre_weekday(moscow_general, 'Friday', '17:00', '23:00'))))
spb_genres.append(", ".join(map(str, genre_weekday(spb_general, 'Friday', '17:00', '23:00'))))

In [30]:
#creating our columns for the dataframe
genres_columns_list = []
for i in range(len(moscow_genres)):
    genres_columns_list.append([moscow_genres[i],spb_genres[i]])
    
#creating the names for the columns
display_as_genre_column_names = ['city', 'Monday Morning Top 5 Genres', 'Friday Night Top 5 Genres']

#getting the data for the dataframe
display_as_genre_data = {'city' : ['Moscow', 'Saint-Petersburg'],
                             'Monday Morning Top 5 Genres' : genres_columns_list[0],
                             'Friday Night Top 5 Genres' : genres_columns_list[1]}

#putting the dataframe together and displaying it
display_as_genre_table = pd.DataFrame(display_as_genre_data, columns=display_as_genre_column_names)
display_as_genre_table

Unnamed: 0,city,Monday Morning Top 5 Genres,Friday Night Top 5 Genres
0,Moscow,"pop, dance, electronic, rock, hiphop","pop, rock, dance, electronic, hiphop"
1,Saint-Petersburg,"pop, dance, rock, electronic, hiphop","pop, electronic, rock, dance, hiphop"


Excellent, this makes the top 5 genres clear.

#### Conclusion

**The data does not support the second hypothesis**: the top 5 genres across cities on Monday morning and Friday nights barely change.

Comparing the top 5 genres on Monday morning shows us that users in Moscow and St. Petersburg listen to very similar music. The only change is the order of 3rd and 4th places: `'electronic'` and `'rock'` in Moscow vs. `'rock'` and `'electronic'` in St. Petersburg.

Friday night gives us similar results. The only difference is the order for 2nd, 3rd, and 4th places: `'rock'`, `'dance'`, and `'electronic'` in Moscow vs. `'electronic'`, `'rock'`, and `'dance'` in St. Petersburg.

### Genre Popularity in Moscow and St. Petersburg

Our third hypothesis states: Moscovites prefer pop, while Peterburgians prefer Russian rap.

Let's group `moscow_general` and `spb_general` according to genre, sort them by popularity, and display the top 10 genres.

In [31]:
moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False).head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

In [32]:
spb_general.groupby('genre')['genre'].count().sort_values(ascending=False).head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

#### Conclusions

**The data partially supports the hypothesis**: 

* Pop music is indeed the most popular genre in Moscow. It is also the most popular genre in St. Petersburg. More over, Russian pop (`ruspop`) is also in the Moscow top 10 genres.
* Russian rap (`rusrap`) is popular in both St. Petersburg and Moscow, though it is more popular in St. Petersburg (8th place in St. Petersburg, 10th place in Moscow).

## Overall Conclusion

We used the data to test three hypotheses.

1. The first hypothesis is **confirmed**. User activity differs between cities. Moscow users are more active on Mondays and Fridays, while St. Peterburg users are more active on Wednesdays.

2. The second hypothesis is **partially confirmed**. Musical tastes do not differ between Monday morning and Friday nights across cities. The only changes are the order of the most popular genres in the top 5.

3. The third hypothesis is **not confirmed**. Overall musical tastes don't vary too much between Moscow and St. Petersburg. The most popular genre in both is popular music (`pop`), and while Russian rap (`rusrap`) is more popular in St. Petersburg than in Moscow, it's still present in the Moscow top 10.

Overall, we were still somewhat limited by the dataset. We were not given any data outside Mondays, Wednesday, and Fridays, and thus the overall picture could've shifted if we had the missing days.