<a href="https://colab.research.google.com/github/yoko-37458/QM2/blob/main/Workshop%201_W1_Python_Recap.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Python Recap

## *Workshop 1*  [![Open In Colab](https://github.com/oballinger/QM2/blob/main/colab-badge.png?raw=1)](https://colab.research.google.com/github/oballinger/QM2/blob/main/notebooks/W01.%20Python%20Recap.ipynb)

## Registering a GitHub account

Before we get started, we need to set a few things up. GitHub is a platform for software development and version control using Git, allowing developers to store and manage their code. Think of it as google docs but for code-- it will be very useful for collaborating on your group projects later in the term, and in your future as a data analyst.

1. Use [this link](https://github.com/join) to register for a GitHub account if you don't already have one.
2. Once that's done, [create a new github repository](https://github.com/new) called "QM2".
3. In this notebook, click "File" and then "Save a copy in GitHub".

Voila! You now have a version of this notebook saved to your own GitHub account. *You will need to do step 3 for all the workshops!* Now, on to python.

## Using Python

In this course, we'll make extensive use of *Python*, a programming language used widely in scientific computing and on the web. We will be using Python as a way to manipulate, plot and analyse data. This isn't a course about learning Python, it's about working with data - but we'll learning a little bit of programming along the way.

By now, you should have done the prerequisites for the module, and understand a bit about how Python is structured, what different commands do, and so on - this is a bit of a refresher to remind you of what we need at the beginning of term.

The particular flavour of Python we're using is *iPython*, which, as we've seen, allows us to combine text, code, images, equations and figures in a *Notebook*. This is a *cell*, written in *markdown* - a way of writing nice text. Contrast this with *code* cell, which executes a bit of Python:

In [None]:
print(2+2)

4


The Notebook format allows you to engage in what Don Knuth describes as [Literate Programming](http://en.wikipedia.org/wiki/Literate_programming):

> […] Instead of writing code containing documentation, the literate programmer writes documentation containing code. No longer does the English commentary injected into a program have to be hidden in comment delimiters at the top of the file, or under procedure headings, or at the end of lines. Instead, it is wrenched into the daylight and made the main focus. The "program" then becomes primarily a document directed at humans, with the code being herded between "code delimiters" from where it can be extracted and shuffled out sideways to the language system by literate programming tools.
[Ross Williams][1]

[1]: http://www.literateprogramming.com/lpquotes.html

Libraries
---------

We will work with a number of *libraries*, which provide additional functions and techniques to help us to carry out our tasks.

These include:

*Pandas:* we'll use this a lot to slice and dice data

*matplotlib*: this is our basic graphing software, and we'll also use it for mapping

*nltk*: The Natural Language Tool Kit will help us work with text

We aren't doing all this to learn to program. We could spend a whole term learning how to use Python and never look at any data, maps, graphs, or visualisations. But we do need to understand a few basics to use Python for working with data. So let's revisit a few concepts that you should have covered in your prerequisites.

Variables
---------

Python can broadly be divided in verbs and nouns: things which *do* things, and things which *are* things. In Python, the verbs can be *commands*, *functions*, or *methods*. We won't worry too much about the distinction here - suffice it to say, they are the parts of code which manipulate data, calculate values, or show things on the screen.

The simplest proper noun object in Python is the *variable*. Variables are given names and store information. This can be, for example, numeric, text, or boolean (true/false). These are all statements setting up variables:

n = 1

t = "hi"

b = True

Now let's try this in code:

In [None]:
n = 1

t = "hi"

b = True

Note that each command is on a new line; other than that, the *syntax* of Python should be fairly clear. We're setting these variables equal to the letters and numbers and phrases and booleans. **What's a boolean?**

The value of this is we now have values tied to these variables - so every time we want to use it, we can refer to the variable:

In [None]:
n

1

In [None]:
t

'hi'

In [None]:
b

True

Because we've defined these variables in the early part of the notebook, we can use them later on.

***Advanced**: where do **classes** fit into this noun/verb picture of variables and commands?*

Where is my data?
-----------------

When we work in excel and text editors, we're used to seeing the data onscreen - and if we manipulate the data in some way (averaging or summing up), we see both the inputs and outputs on screen. The big difference in working with Python is that we don't see our variables all of the time, or the effect we're having on them. They're there in the background, but it's usually worth checking in on them from time to time, to see whether our processes are doing what we think they're doing.

This is pretty easy to do - we can just type the variable name, or "print(*variable name*)":

In [None]:
n = n+1
print(n)
print(t)
print(b)

2
hi
True


Flow
----

Python, in common with all programming languages, executes commands in a sequence - we might refer to this as the "ineluctable march of the machines", but it's more common referred to as the *flow* of the code (we'll use the word "code" a lot - it just means commands written in the programming language). In most cases, code just executes in the order it's written. This is true within each *cell* (each block of text in the notebook), and it's true when we execute the cells in order; that's why we can refer back to the variables we defined earlier:

In [None]:
print(n)

2


If we make a change to one of these variables, say n:

In [None]:
n = 3

and execute the above "print n" command, you'll see that it has changed n to 3. So if we go out of order, the obvious flow of the code is confused. For this reason, try to write your code so it executes in order, one cell at a time. At least for the moment, this will make it easier to follow the logic of what you're doing to data.

*Advanced*: what happens to this flow when you write *functions* to automate common tasks?

***Exercise - Setting up variables***:


1. Create a new cell.

2. Create the variables "name", and assign your name to it.

3. Create a variable "Python" and assign a score out of 10 to how much you like Python.

4. Create a variable "prior" and if you've used Python before, assign True; otherwise assign False to the variable

5. Print these out to the screen

In [None]:
name = 'yoko'
Python = 2
prior = False
print(name)
print(Python)
print(prior)


yoko
2
False


Downloading Data
--------------------------

Lets fetch the data we will be using for this session. There are two ways in which you can upload data to the Colab notebook. You can use the following code to upload a CSV or similar data file.


In [None]:
from google.colab import files
uploaded = files.upload()

Or you can use the following cell to fetch the data directly from the QM2 server.

Let's create a folder that we can store all our data for this session

In [None]:
!mkdir data

In [None]:
!mkdir ./data/wk1
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/data.csv -o ./data/wk1/data.csv
!curl https://s3.eu-west-2.amazonaws.com/qm2/wk1/sample_group.csv -o ./data/wk1/sample_group.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   203  100   203    0     0   2872      0 --:--:-- --:--:-- --:--:--  3029
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   297  100   297    0     0   1844      0 --:--:-- --:--:-- --:--:--  1879


Storing and importing data
--------------------------

Typically, data we look at won't be just one number, or one bit of text. Python has a lot of different ways of dealing with a bunch of numbers: for example, a list of values is called a **list**:

In [None]:
listy = [1,2,3,6,9]
print(listy)

[1, 2, 3, 6, 9]


A set of values *linked* to an index (or key) is called a **dictionary**; for example:

In [None]:
dicty = {'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}
print(dicty)

{'Bob': 1.2, 'Mike': 1.2, 'Coop': 1.1, 'Maddy': 1.3, 'Giant': 2.1}


Notice that the list uses square brackets with values separated by commas, and the dict uses curly brackets with pairs separated by commas, and colons (:) to link a *key* (index or address) with a value.

(You might notice that they haven't printed out in the order you entered them)

***Advanced**: Print out 1) The third element of **listy**, and 2) The element of **dicty** relating to Giant

In [None]:
listy[2]
print(dicty["Giant"])

2.1


We'll discuss different ways of organising data again soon, but for now we'll look at *dataframes* - the way our data-friendly *library* **Pandas** works with data. We'll be using Pandas a lot this term, so it's good to get started with it early.

Let's start by importing pandas. We'll also import another library, but we're not going to worry about that too much at the moment.  

If you see a warning about 'Building Font Cache' don't worry - this is normal.

In [None]:
import pandas

import matplotlib
%matplotlib inline

Let's import a simple dataset and show it in pandas. We'll use a pre-prepared ".csv" file, which needs to be in the same folder as our code.

In [None]:
data = pandas.read_csv('./data/wk1/data.csv')
data.head()

Unnamed: 0,Name,First Appearance,Approx height,Gender,Law Enforcement
0,Bob,1.2,6.0,Male,False
1,Mike,1.2,5.5,Male,False
2,Coop,1.1,6.0,Male,True
3,Maddy,1.3,5.5,Female,False
4,Giant,2.1,7.5,Male,False


What we've done here is read in a .csv file into a dataframe, the object pandas uses to work with data, and one that has lots of methods for slicing and dicing data, as we will see over the coming weeks. The head() command tells iPython to show the first few columns/rows of the data, so we can start to get a sense of what the data looks like and what sort of type of objects is represents.

A common first step for exploring our data is to sort it. In Pandas, this can be done easily with the `sort_values()` function. We can specify which column to sort the data by, and whether we want to sort in ascending or descending order, using the optional arguments `by` and `ascending`, respectively. In the example below, we're sorting in *descending* order of height:

In [None]:
data.sort_values(by='Approx height', ascending=False).head()

Unnamed: 0,Name,First Appearance,Approx height,Gender,Law Enforcement
4,Giant,2.1,7.5,Male,False
0,Bob,1.2,6.0,Male,False
2,Coop,1.1,6.0,Male,True
1,Mike,1.2,5.5,Male,False
3,Maddy,1.3,5.5,Female,False


# Supplementary: Kaggle exercises

If you've gotten this far, congratulations! To further hone your skills, try working your way through the five [intro to programming notebooks on Kaggle](https://www.kaggle.com/learn/intro-to-programming). These cover a range of skills that we'll be using throughout the term. Kaggle is a very useful resource for learning data science, so making an account may not be a bad idea!

# Assessed Question

The URL below contains a dataset of the most streamed songs on spotify in 2023:
https://storage.googleapis.com/qm2/wk1/spotify-2023.csv

1. Download the dataset and save it in the `./data/wk1/` directory.
2. Load the dataset as a pandas dataframe, and inspect it. Two of the column names have accidentally been swapped around. Use common sense to figure out which ones these are before proceeding with your analysis.
3. Filter the dataset to only contain songs in the key of C sharp.
4. Sort the dataframe in descending order of streams.

QUESTION: which artist has the highest number of streams?

In [None]:
#The Weekend

#Saving the dataset to the directory

In [None]:
!mkdir -p data/wk1
!curl -o ./data/wk1/spotify-2023.csv "https://storage.googleapis.com/qm2/wk1/spotify-2023.csv"!curl -o ./data/wk1/spotify-2023.csv "https://storage.googleapis.com/qm2/wk1/spotify-2023.csv"

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   191  100   191    0     0    578      0 --:--:-- --:--:-- --:--:--   577
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  105k  100  105k    0     0   299k      0 --:--:-- --:--:-- --:--:--  300k


In [None]:
import pandas as pd
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)


data = pd.read_csv('./data/wk1/spotify-2023.csv')

from IPython.display import display
display(data)

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,bpm,in_apple_playlists,in_apple_charts,in_deezer_playlists,in_deezer_charts,in_shazam_charts,streams,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,'Till I Collapse,"Eminem, Nate Dogg",2,2002,5,26,22923,0,1695712020,78,46,2515,1,0.0,171,C#,Major,55,10,85,7,0,8,20
1,(It Goes Like) Nanana - Edit,Peggy Gou,1,2023,6,15,2259,59,57876440,0,0,109,17,0.0,130,G,Minor,67,96,88,12,19,8,4
2,10 Things I Hate About You,Leah Kate,1,2022,3,23,1301,0,185550869,23,1,15,0,0.0,154,G#,Major,54,45,79,1,0,17,5
3,10:35,"TiÃ¯Â¿Â½Ã¯Â¿Â½sto, Tate M",2,2022,11,1,4942,26,325592432,190,104,147,18,63.0,120,G#,Major,70,70,79,7,0,18,10
4,2 Be Loved (Am I Ready),Lizzo,1,2022,7,14,3682,6,247689123,41,0,158,2,68.0,156,G,Major,72,92,77,9,0,8,11
5,2055,Sleepy hallow,1,2021,4,14,2226,0,624515457,29,0,44,0,0.0,161,F#,Minor,78,65,52,46,0,12,31
6,212,"Mainstreet, Chefin",2,2022,1,15,352,0,143139338,10,0,39,0,0.0,154,D,Minor,79,86,52,66,0,9,7
7,25k jacket (feat. Lil Baby),"Gunna, Lil Baby",2,2022,1,7,620,0,54937991,17,3,3,0,0.0,115,F,Minor,90,74,54,16,0,13,28
8,295,Sidhu Moose Wala,1,2021,5,15,246,4,183273246,4,106,0,0,7.0,90,B,Minor,68,54,76,21,0,11,20
9,505,Arctic Monkeys,1,2007,4,20,13985,25,1217120710,30,80,588,1,1.0,140,,Major,52,20,85,0,0,7,5


#Swapping the Columns

In [None]:
data['streams'], data['bpm'] = data['bpm'], data['streams']
display(data)


Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,bpm,in_apple_playlists,in_apple_charts,in_deezer_playlists,in_deezer_charts,in_shazam_charts,streams,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,'Till I Collapse,"Eminem, Nate Dogg",2,2002,5,26,22923,0,171,78,46,2515,1,0.0,1695712020,C#,Major,55,10,85,7,0,8,20
1,(It Goes Like) Nanana - Edit,Peggy Gou,1,2023,6,15,2259,59,130,0,0,109,17,0.0,57876440,G,Minor,67,96,88,12,19,8,4
2,10 Things I Hate About You,Leah Kate,1,2022,3,23,1301,0,154,23,1,15,0,0.0,185550869,G#,Major,54,45,79,1,0,17,5
3,10:35,"TiÃ¯Â¿Â½Ã¯Â¿Â½sto, Tate M",2,2022,11,1,4942,26,120,190,104,147,18,63.0,325592432,G#,Major,70,70,79,7,0,18,10
4,2 Be Loved (Am I Ready),Lizzo,1,2022,7,14,3682,6,156,41,0,158,2,68.0,247689123,G,Major,72,92,77,9,0,8,11
5,2055,Sleepy hallow,1,2021,4,14,2226,0,161,29,0,44,0,0.0,624515457,F#,Minor,78,65,52,46,0,12,31
6,212,"Mainstreet, Chefin",2,2022,1,15,352,0,154,10,0,39,0,0.0,143139338,D,Minor,79,86,52,66,0,9,7
7,25k jacket (feat. Lil Baby),"Gunna, Lil Baby",2,2022,1,7,620,0,115,17,3,3,0,0.0,54937991,F,Minor,90,74,54,16,0,13,28
8,295,Sidhu Moose Wala,1,2021,5,15,246,4,90,4,106,0,0,7.0,183273246,B,Minor,68,54,76,21,0,11,20
9,505,Arctic Monkeys,1,2007,4,20,13985,25,140,30,80,588,1,1.0,1217120710,,Major,52,20,85,0,0,7,5


#Songs in c sharp!

In [None]:
c_sharp_songs = data[data["key"]=="C#"]
display(c_sharp_songs)

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,bpm,in_apple_playlists,in_apple_charts,in_deezer_playlists,in_deezer_charts,in_shazam_charts,streams,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
0,'Till I Collapse,"Eminem, Nate Dogg",2,2002,5,26,22923,0,171,78,46,2515,1,0.0,1695712020,C#,Major,55,10,85,7,0,8,20
14,A Veces (feat. Feid),"Feid, Paulo Londra",2,2022,11,3,573,0,92,2,0,7,0,0.0,73513683,C#,Major,80,81,67,4,0,8,6
18,AMERICA HAS A PROBLEM (feat. Kendrick Lamar),"Kendrick Lamar, BeyoncÃ¯Â¿",2,2023,5,19,896,0,126,34,2,33,0,1.0,57089066,C#,Major,78,20,70,1,0,16,4
27,Afraid To Feel,LF System,1,2022,5,2,5898,5,128,129,55,128,0,101.0,244790012,C#,Major,58,68,91,2,0,27,11
31,Agosto,Bad Bunny,1,2022,5,6,897,0,115,6,20,8,0,0.0,246127838,C#,Minor,85,72,58,9,0,49,12
47,Andrea,"Buscabulla, Bad Bunny",2,2022,5,6,1195,0,103,8,30,13,1,1.0,344055883,C#,Minor,80,45,62,76,0,10,38
60,Area Codes,"Kaliii, Kaliii",2,2023,3,17,1197,13,155,44,34,25,1,171.0,113509496,C#,Major,82,51,39,2,0,9,49
67,BABY HELLO,"Rauw Alejandro, Bizarrap",2,2023,6,23,1004,35,130,42,80,58,3,169.0,54266102,C#,Minor,77,84,89,17,0,43,5
70,BILLIE EILISH.,Armani White,1,2022,1,20,2537,0,100,49,1,67,11,1.0,277132266,C#,Major,90,75,50,11,0,9,26
71,BREAK MY SOUL,BeyoncÃ¯Â¿,1,2022,6,21,9724,0,115,222,61,259,14,2.0,354614964,C#,Minor,70,87,88,4,0,26,8


#Songs in Descending Order

In [None]:
sorted_data = data.sort_values(by='streams', ascending=False)
display(sorted_data)

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,bpm,in_apple_playlists,in_apple_charts,in_deezer_playlists,in_deezer_charts,in_shazam_charts,streams,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
101,Blinding Lights,The Weeknd,1,2019,11,29,43899,69,171,672,199,3421,20,,3703895074,C#,Major,50,38,80,0,0,9,7
701,Shape of You,Ed Sheeran,1,2017,1,6,32181,10,96,33,0,6808,7,0.0,3562543890,C#,Minor,83,93,65,58,0,9,8
731,Someone You Loved,Lewis Capaldi,1,2018,11,8,17836,53,110,440,125,1800,0,,2887241814,C#,Major,50,45,41,75,0,11,3
185,Dance Monkey,Tones and I,1,2019,5,10,24529,0,98,533,167,3595,6,,2864791672,F#,Minor,82,54,59,69,0,18,10
761,Sunflower - Spider-Man: Into the Spider-Verse,"Post Malone, Swae Lee",2,2018,10,9,24094,78,90,372,117,843,4,69.0,2808096550,D,Major,76,91,50,54,0,7,5
569,One Dance,"Drake, WizKid, Kyla",3,2016,4,4,43257,24,104,433,107,3631,0,26.0,2713922350,C#,Major,77,36,63,1,0,36,5
670,STAY (with Justin Bieber),"Justin Bieber, The Kid Laroi",2,2021,7,9,17050,36,170,492,99,798,31,0.0,2665343922,C#,Major,59,48,76,4,0,10,5
88,Believer,Imagine Dragons,1,2017,1,31,18986,23,125,250,121,2969,10,31.0,2594040133,A#,Minor,77,74,78,4,0,23,11
154,Closer,"The Chainsmokers, Halsey",2,2016,5,31,28032,0,95,315,159,2179,0,44.0,2591224264,G#,Major,75,64,52,41,0,11,3
744,Starboy,"The Weeknd, Daft Punk",2,2016,9,21,29536,79,186,281,137,2445,1,140.0,2565529693,G,Major,68,49,59,16,0,13,28


#Most Streamed Song & Artist :)

In [None]:
most_streamed_song = data.sort_values(by='streams', ascending=False).head(1)
display(most_streamed_song)

Unnamed: 0,track_name,artist(s)_name,artist_count,released_year,released_month,released_day,in_spotify_playlists,in_spotify_charts,bpm,in_apple_playlists,in_apple_charts,in_deezer_playlists,in_deezer_charts,in_shazam_charts,streams,key,mode,danceability_%,valence_%,energy_%,acousticness_%,instrumentalness_%,liveness_%,speechiness_%
101,Blinding Lights,The Weeknd,1,2019,11,29,43899,69,171,672,199,3421,20,,3703895074,C#,Major,50,38,80,0,0,9,7
