<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Getting-data" data-toc-modified-id="Getting-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Getting data</a></span></li><li><span><a href="#Data-preprocessing" data-toc-modified-id="Data-preprocessing-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data preprocessing</a></span><ul class="toc-item"><li><span><a href="#Headers-style" data-toc-modified-id="Headers-style-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>Headers style</a></span></li><li><span><a href="#Missing-values" data-toc-modified-id="Missing-values-2.2"><span class="toc-item-num">2.2&nbsp;&nbsp;</span>Missing values</a></span></li><li><span><a href="#Duplicates" data-toc-modified-id="Duplicates-2.3"><span class="toc-item-num">2.3&nbsp;&nbsp;</span>Duplicates</a></span></li><li><span><a href="#Data-cleaning-results" data-toc-modified-id="Data-cleaning-results-2.4"><span class="toc-item-num">2.4&nbsp;&nbsp;</span>Data cleaning results</a></span></li></ul></li><li><span><a href="#Hypotheses-testing" data-toc-modified-id="Hypotheses-testing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Hypotheses testing</a></span><ul class="toc-item"><li><span><a href="#Do-people-really-listen-to-music-differently-in-different-cities?" data-toc-modified-id="Do-people-really-listen-to-music-differently-in-different-cities?-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Do people really listen to music differently in different cities?</a></span></li><li><span><a href="#Monday-morning-and-Friday-evening---different-music-or-the-same?" data-toc-modified-id="Monday-morning-and-Friday-evening---different-music-or-the-same?-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Monday morning and Friday evening - different music or the same?</a></span></li><li><span><a href="#Moscow-and-Saint-Petersburg---two-different-capitals-with-different-music-preferences?" data-toc-modified-id="Moscow-and-Saint-Petersburg---two-different-capitals-with-different-music-preferences?-3.3"><span class="toc-item-num">3.3&nbsp;&nbsp;</span>Moscow and Saint-Petersburg - two different capitals with different music preferences?</a></span></li></ul></li><li><span><a href="#Results" data-toc-modified-id="Results-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Results</a></span></li></ul></div>

# BIG CITIES MUSIC

## Getting data

The first step is to examine the data provided by "Yandex.Music" service for the project. We can start by importing `pandas` library for conducting subsequent data analysis.

In [4]:
# importing pandas library
import pandas as pd

Now that we have imported the library, we can read the file `yandex_music_project.csv` containing data we need into a DataFrame to be stored in `df` variable.

In [5]:
# reading the dataset into a DataFrame
df = pd.read_csv('yandex_music_project.csv')

We start data analysis by printing the first ten rows of `df` in order to take a look at what the data looks like.

In [6]:
# displaying the first 10 rows of the DataFrame
df.head(10)

Unnamed: 0,userID,Track,artist,genre,City,time,Day
0,FFB692EC,Kamigata To Boots,The Mass Missile,rock,Saint-Petersburg,20:28:33,Wednesday
1,55204538,Delayed Because of Accident,Andreas Rönnberg,rock,Moscow,14:07:09,Friday
2,20EC38,Funiculì funiculà,Mario Lanza,pop,Saint-Petersburg,20:58:07,Wednesday
3,A3DD03C9,Dragons in the Sunset,Fire + Ice,folk,Saint-Petersburg,08:37:09,Monday
4,E2DC1FAE,Soul People,Space Echo,dance,Moscow,08:34:34,Monday
5,842029A1,Преданная,IMPERVTOR,rusrap,Saint-Petersburg,13:09:41,Friday
6,4CB90AA5,True,Roman Messer,dance,Moscow,13:00:07,Wednesday
7,F03E1C1F,Feeling This Way,Polina Griffith,dance,Moscow,20:47:49,Wednesday
8,8FA1D3BE,И вновь продолжается бой,,ruspop,Moscow,09:17:40,Friday
9,E772D5C0,Pessimist,,dance,Saint-Petersburg,21:20:49,Wednesday


We can also dig deeper into the DataFrame by printing a concise summary of it.

In [7]:
# obtaining general information about the data in the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65079 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0     userID  65079 non-null  object
 1   Track     63848 non-null  object
 2   artist    57876 non-null  object
 3   genre     63881 non-null  object
 4     City    65079 non-null  object
 5   time      65079 non-null  object
 6   Day       65079 non-null  object
dtypes: object(7)
memory usage: 3.5+ MB


Now, let's examine the obtained information in more detail:

1. The DataFrame has seven columns, where each has `object` datatype.


2. Examine which columns `df` has and what information they contain:

* `userID` - user identification number;
* `Track` - name of a track;
* `artist` - name of an artist;
* `genre` - name of a genre;
* `City` - city where a song was played;
* `time` - time when a user started listening to a song;
* `Day` - day of the week.


3. The number of values in columns differs. This signifies the fact that we are dealing with missing values in the data.

**Conclusions**

Each row of the DataFrame contains information about different songs of a particular genre and of a particular artist, which users were listening to at a particular time and on a particular day of the week. 

There are three problems that are to be solved: inadequate column names, missing values and potential presence of duplicates in data. In order to test working hypotheses, information contained in columns `time`, `Day` and `City` will be especially useful. Data from `genre` column will allow determining the most popular genres.

## Data preprocessing

In this section, we will rename the columns, get rid of missing values and verify the data for duplicates.

### Headers style

We can start by first obtaining the names of the columns in the DataFrame.

In [8]:
# displaying the names of columns of the current DataFrame
df.columns

Index(['  userID', 'Track', 'artist', 'genre', '  City  ', 'time', 'Day'], dtype='object')

It can be seen that there are problems with the names of the columns:

* Lowercase letters are combined with capital letters;
* There are spaces in the column names;
* Some column names do not have underscores.

In order to simplify the subsequent data analysis, we should rename the columns in a correct and convenient way and verify the result.

In [9]:
# changing the names of columns of the current DataFrame
df = df.rename(columns={'  userID':'user_id', 'Track':'track', '  City  ':'city', 'Day':'day'})

In [10]:
# verifying the changes made
df.columns

Index(['user_id', 'track', 'artist', 'genre', 'city', 'time', 'day'], dtype='object')

### Missing values

Let's compute the number of missing values in the DataFrame.

In [11]:
# calculating the number of missing values in the DataFrame
df.isna().sum()

user_id       0
track      1231
artist     7203
genre      1198
city          0
time          0
day           0
dtype: int64

Missing values refer to the fact that we do not have all available information for some tracks. There can be a lot of different reasons for this: on the one hand, the artist of some song might not have been specified or, on the other hand, in a worse case scenario, there can be problems with data recording itself.

Missing values in `track` and `artist` are not so important for the subsequent data analysis so we can just replace these values with some explicit notation. The information contained in `genre` column, however, is crucial and can complicate the comparison of musical tastes in Moscow and Saint-Petersburg. Thus, in this case we will have to also replace such values with explicit notation and evaluate how much they will affect the analysis.

To deal with such a problem, we can replace the missing values in columns `track`, `artist` and `genre` with a string -  `'unknown'`.

In [12]:
# replacing missing values in columns 'track', 'artist' and 'genre' with 'unknown'
columns_to_replace = ['track', 'artist', 'genre']
for column in columns_to_replace:
    df[column] = df[column].fillna('unknown')

Before going further, verify whether there are any missing values left in the table.

In [13]:
# veryfying the changes made
df.isna().sum()

user_id    0
track      0
artist     0
genre      0
city       0
time       0
day        0
dtype: int64

### Duplicates

Compute the number of explicit duplicates.

In [14]:
# computing the number of exact duplicates
df.duplicated().sum()

3826

We can easily get rid of such duplicates.

In [15]:
# removing duplicated rows
df = df.drop_duplicates()

Compute the number of explicit duplicates again for verification purposes.

In [16]:
# verifying changes made
df.duplicated().sum()

0

Now, we need to get rid of implicit duplicates. This can be the case, for instance, when the name of the same genre could have been written down in different ways. As a result, such mistakes can have an adverse impact upon results of the analysis.

Let's start by displaying the unique genre names in the alphabetical order.

In [17]:
# printing the names of unique genre names
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

By skimming through the names of genres, it can be seen that we have the following implicit duplicates:

* `hip`
* `hop`
* `hip-hop`

We need to get rid of such duplicates by replacing the duplicated names with one common name `hiphop`:

In [18]:
# replacing duplicates with one single name 'hiphop'
df['genre'] = df['genre'].replace(['hip', 'hop', 'hip-hop'], 'hiphop')

Verify whether we correctly got rid of duplicates.

In [19]:
# verifying changes made
print(df['genre'].sort_values().unique())

['acid' 'acoustic' 'action' 'adult' 'africa' 'afrikaans' 'alternative'
 'alternativepunk' 'ambient' 'americana' 'animated' 'anime' 'arabesk'
 'arabic' 'arena' 'argentinetango' 'art' 'audiobook' 'author' 'avantgarde'
 'axé' 'baile' 'balkan' 'beats' 'bigroom' 'black' 'bluegrass' 'blues'
 'bollywood' 'bossa' 'brazilian' 'breakbeat' 'breaks' 'broadway'
 'cantautori' 'cantopop' 'canzone' 'caribbean' 'caucasian' 'celtic'
 'chamber' 'chanson' 'children' 'chill' 'chinese' 'choral' 'christian'
 'christmas' 'classical' 'classicmetal' 'club' 'colombian' 'comedy'
 'conjazz' 'contemporary' 'country' 'cuban' 'dance' 'dancehall' 'dancepop'
 'dark' 'death' 'deep' 'deutschrock' 'deutschspr' 'dirty' 'disco' 'dnb'
 'documentary' 'downbeat' 'downtempo' 'drum' 'dub' 'dubstep' 'eastern'
 'easy' 'electronic' 'electropop' 'emo' 'entehno' 'epicmetal' 'estrada'
 'ethnic' 'eurofolk' 'european' 'experimental' 'extrememetal' 'fado'
 'fairytail' 'film' 'fitness' 'flamenco' 'folk' 'folklore' 'folkmetal'
 'folkrock' 

### Data cleaning results

Lastly, let's print the concise summary of the DataFrame again to make sure that we have correctly conducted data cleaning.

In [20]:
# displaying general information about the DataFrame
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 61253 entries, 0 to 65078
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype 
---  ------   --------------  ----- 
 0   user_id  61253 non-null  object
 1   track    61253 non-null  object
 2   artist   61253 non-null  object
 3   genre    61253 non-null  object
 4   city     61253 non-null  object
 5   time     61253 non-null  object
 6   day      61253 non-null  object
dtypes: object(7)
memory usage: 3.7+ MB


**Conclusions**

Data preprocessing has identified 3 problems in data:

* Inadequate headers style;
* Missing values;
* Duplicates - explicit and implicit.

We have managed to simplify working with the DataFrame, thanks to which the results of the data analysis will be more accurate. The missing values in `genre` column have been replaced with `unknown` and there is still work to do to understand whether this will affect the results. 

Now, we can get to testing the main hypotheses.

## Hypotheses testing

### Do people really listen to music differently in different cities?

The first hypothesis states that users listen to music differently in Moscow and Saint-Petersburg. We can test this hypothesis by using data on three weekdays: Monday, Wednesday and Friday. In order to accomplish that we need to:

* Divide users of Moscow and Saint-Peterburg;
* Compare the number of tracks listened to by each user group on Monday, Wednesday and Friday.

To better understand the behavior of each group of user, let's first estimate the activity of users in each city:

In [21]:
# grouping the DataFrame by city and computing the number of track plays
df.groupby('city')['user_id'].count()

city
Moscow              42741
Saint-Petersburg    18512
Name: user_id, dtype: int64

There are more track plays in Moscow than in Saint-Petersburg. However, it does not necessarily mean that Moscow users listen to music more often: there are just more users in Moscow.

Now, let's group data by weekday and compute numbers of track plays on Monday, Wednesday and Friday (the data contains information about track plays only on these days):

In [22]:
# grouping the DataFrame by day and computing the number of track plays
df.groupby('day')['user_id'].count()

day
Friday       21840
Monday       21354
Wednesday    18059
Name: user_id, dtype: int64

On average, users from two cities are less active on Wednesday, but the picture can change once we examine each city separately.

Now, let's create a function that will combine the above calculations in order to understand how often users from the two cities listen to music.

The function `number_tracks()` will take two parameters:

* `day` - day of the week;
* `city` - name of the city.

It will return the number of tracks played by users in a particular city and on a particular day.

In [23]:
# <Creating number_tracks() function>
# Function has two parameters: <day>, <city>.
# track_list variable stores rows of df DataFrame for which 
# the value in 'day' column is equal to <day> and at the same time
# the value in 'city' column is equal to <city>
# track_list_count stores the number of values in 'user_id' column,
# computed using count() for track_list DataFrame.
# Function returns a number - track_list_count.

# Function for counting the number of track plays for a particular day and city.
# By a means of sequential filtering with logical indexation,
# it first gets the rows from the initial DataFrame with the needed day,
# then filters the rows with the needed city,
# uses count() to calculate the number of values in user_id column. 
# This value will be returned by a function as a result.
def number_tracks(day, city):
    track_list = df[df['day'] == day]
    track_list = track_list[track_list['city'] == city]
    track_list_count = track_list['user_id'].count()
    return track_list_count

Now, we can sequentially call the function `number_tracks()` in order to obtain the information needed:

In [24]:
# storing the names of cities in a list
cities = ['Moscow', 'Saint-Petersburg']
# storing the names of weekdays in a list
days = ['Monday', 'Wednesday', 'Friday']
# initializing the nested list to store track plays for each city and day
num_track_plays = [[], []]
for i in range(0, len(cities)):
    for j in range(0, len(days)):
        num_track_plays[i].append(number_tracks(days[j], cities[i]))

The resulting nested list contains the number of track plays in Moscow on each weekday in the first inner list, while the second one - the number of track plays in Saint-Petersburg on each weekday.

Now, we can create a DataFrame for convenient results representation:

In [25]:
# creating the headers of a new DataFrame
columns = ['City']
columns.extend(days)

# retrieving track plays from the nested list
moscow_track_plays = num_track_plays[0]
spb_track_plays = num_track_plays[1]

# defining the first row with data in a DataFrame
row_1 = [cities[0]]
row_1.extend(moscow_track_plays)

# defining the second row with data in a DataFrame
row_2 = [cities[1]]
row_2.extend(spb_track_plays)

# creating a DataFrame with results
data = [row_1, row_2]
info = pd.DataFrame(data=data, columns=columns)

# displaying a DataFrame
info

Unnamed: 0,City,Monday,Wednesday,Friday
0,Moscow,15740,11056,15945
1,Saint-Petersburg,5614,7003,5895


**Conlusions**

Data shows difference in user behavior:

* Number of track plays in Moscow peaks on Monday and Friday, while it declines on Wednesday;
* In Saint-Petersburg, conversely, users listen to music more frequently on Wednesday.

Thus, the data confirm the first hypothesis.

### Monday morning and Friday evening - different music or the same?

According to the second hypothesis, users in Moscow and Saint-Petersburg differ in their music genre preferences on Monday morning. Likewise, genres differ on Friday evening as well - depending on the city.

Create two variables and save in them two dataframes, each with information about a specific city:

In [26]:
# taking a part of the DataFrame only for users from Moscow
moscow_general = df[df['city'] == 'Moscow']

In [27]:
# taking a part of the DataFrame only for users from Saint-Petersburg
spb_general = df[df['city'] == 'Saint-Petersburg']

Now, define a function `genre_weekday()` with four parameters:

* `df` - dataframe with data;
* `day` - day of the week;
* `time1` - initial time stamp in 'hh:mm' format;
* `time2` - last time stamp in 'hh:mm' format.

The function will return information about top-10 genres of tracks played on a particular day between two time stamps.

In [28]:
# Defining genre_weekday() function with parameters: <df>, <day>, <time1>, <time2>,
# which returns information about the most popular genres on a given day at a given time:
# 1) genre_df stores rows of df DataFrame, for which:
#    - value in 'day' column is equal to <day>
#    - value in 'time' column is more than <time1>
#    - value in 'time' column is less than <time2>
#    Then, use sequential filtering with logical indexation to obtain genre_df DataFrame.
# 2) group genre_df by 'genre' column, take one of its columns and 
#    use count() to calculate the number of entries for each of the present genres,
#    assign the resulting Series to genre_df_count
# 3) sort genre_df_count in the descending order and save it in genre_df_sorted
# 4) return Series from the first ten values of genre_df_sorted, these will be top-10
#    popular genres (on a given day, at a given time)
def genre_weekday(df, day, time1, time2):
    # sequential filtering
    # genre_df stores only rows where values of 'day' column are equal to day
    genre_df = df[df['day'] == day]
    # genre_df stores only rows of genre_df for which values of 'time' column are less than time2
    genre_df = genre_df[genre_df['time'] < time2]
    # genre_df stores only rows of genre_df for which values of 'time' column are more than time1
    genre_df = genre_df[genre_df['time'] > time1]
    # group the filtered DataFrame by 'genre' column with names of genres and compute the number of rows for each genre
    genre_df_grouped = genre_df.groupby('genre')['genre'].count()
    # sort the result in the descending order
    genre_df_sorted = genre_df_grouped.sort_values(ascending=False)
    # return top-10 genres on a certain day during the time period specified
    return genre_df_sorted[:10]

In [29]:
# computing the number of track plays in Moscow on Monday morning
genre_weekday(moscow_general, 'Monday', '07:00', '11:00')

genre
pop            781
dance          549
electronic     480
rock           474
hiphop         286
ruspop         186
world          181
rusrap         175
alternative    164
unknown        161
Name: genre, dtype: int64

In [30]:
# computing the number of track plays in Saint-Petersburg on Monday morning
genre_weekday(spb_general, 'Monday','07:00','11:00')

genre
pop            218
dance          182
rock           162
electronic     147
hiphop          80
ruspop          64
alternative     58
rusrap          55
jazz            44
classical       40
Name: genre, dtype: int64

In [31]:
# computing the number of track plays in Moscow on Friday evening
genre_weekday(moscow_general, 'Friday', '17:00', '23:00')

genre
pop            713
rock           517
dance          495
electronic     482
hiphop         273
world          208
ruspop         170
alternative    163
classical      163
rusrap         142
Name: genre, dtype: int64

In [32]:
# computing the number of track plays in Saint-Petersburg on Friday evening
genre_weekday(spb_general, 'Friday', '17:00', '23:00')

genre
pop            256
electronic     216
rock           216
dance          210
hiphop          97
alternative     63
jazz            61
classical       60
rusrap          59
world           54
Name: genre, dtype: int64

**Conclusions**

By comparing top-10 genres on Monday morning, the following conclusion can be drawn:

* Users in Moscow and Saint-Petersburg listen to similar music. The only difference is that Moscow rating has "world" genre, while that for Saint-Petersburg includes jazz and classical music.
* In Moscow there are so many missing values that `unknown` genre took the tenth place in the rating of the most popular genres. Hence, missing values account for a substantial fraction of data and are highly likely to affect the results. 

Friday evening does not change this picture: some genres ascend and descend across top-10 rating but in general rating stays the same. 

Thus, the second hypothesis has been partially confirmed:

* Users listen to similar music at the beginning and the end of the week. 
* There is not a distinct difference between Moscow and Saint-Petersburg. Users in Moscow often listen to the russian popular music, while users in Saint-Peterburg - jazz. 

However, the missing values does not allow accurately confirming the above results. Moscow has so many of them that it is likely that top-10 rating could have looked differently should we have had all information.

### Moscow and Saint-Petersburg - two different capitals with different music preferences?

The third hypothesis: Saint-Petersburg is a city where "rap" genre prevails, while Moscow is popular with "pop".

We can start by grouping the table by genre and counting the number of tracks played in Moscow.

In [33]:
# grouping and sorting the DataFrame by genre and computing the number of tracks played in Moscow
moscow_genres = moscow_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [34]:
# printing the first ten rows of DataFrame
moscow_genres.head(10)

genre
pop            5892
dance          4435
rock           3965
electronic     3786
hiphop         2096
classical      1616
world          1432
alternative    1379
ruspop         1372
rusrap         1161
Name: genre, dtype: int64

We can do the same exercise for Saint-Peterburg.

In [35]:
# grouping and sorting the DataFrame by genre and computing the number of tracks played in Saint-Petersburg
spb_genres = spb_general.groupby('genre')['genre'].count().sort_values(ascending=False)

In [36]:
# printing the first ten rows of DataFrame
spb_genres.head(10)

genre
pop            2431
dance          1932
rock           1879
electronic     1736
hiphop          960
alternative     649
classical       646
rusrap          564
ruspop          538
world           515
Name: genre, dtype: int64

**Conclusions**

The hypothesis has been partially confirmed:

* Pop-music is the most popular genre in Moscow as the hypothesis stipulated. Besides, top-10 genres also include a closer music genre - russian pop music.
* Contrary to expectations, rap is as popular in Moscow as it is in Saint-Petersburg.

## Results

Working hypotheses:

* There are differences in musical preferences in Moscow and Saint-Petersburg;
* Top-10 rating of popular genres on Monday morning and Friday evening can be characterized by distinct differences;
* Population of two cities prefers different music genres.

**General results**

Moscow and Saint-Petersburg are similar in musical tastes: pop music is preferred everywhere. Additionally, it is not contingent upon weekday: people always listen to what they like. However, comparing by weekday Moscow users listen to music more on Monday and Friday than Wednesday, while in Saint-Peterburg the reverse takes place: users play track more frequently on Wednesday relative to Monday and Friday.

Thus, the first hypothesis has been confirmed and the other two - partially confirmed.