# Part 1: The Basics #

**Import some useful packages:<br>**
pandas (pd) is used to read and use data in tables (*DataFrames*)<br>
matplotlib (plt) is used to make graphs (*Plots*)

Usually we import all of the packages we need at the top of the page, so we can easily keep track of them

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

**Top Tip #1** <br>
Printing your output

In [None]:
# Use the hash (#) to add comments to your code
print('This is how you print some text')

In [None]:
print("""
----------------
I
Want
'Lots'
of
Lines
----------------
""") # if you want to include new lines or other pubctuation within your text, use triple quotes (""")

In [None]:
print(10) # you can print text (called a String) or numbers
print('ten')

**Top Tip #2**<br>
Some quick maths

In [None]:
# Python already has some basic maths functions
print(9+10)
print(4*4)

**Top Tip #3**<br>
Variables and data types

- In python, you can assign a value to something using '='<br>
- Once it has a value, it is called a variable and can be used by any other code after that point<br>
- You can call your variables anything you want (as long as there are no spaces), and it is good practice to give them informative names so other people can use your code and understand what is going on

In [None]:
# Try assigning a value to something
katies_favourite_number = 13
my_favourite_number = ___ #ADD YOUR NUMBER HERE

In [None]:
# once you have assigned a value to something, you can perform operations like print() on it
print(my_favourite_number)

In [None]:
# ADVANCED

# A quick way to print text and values together using f in front of the String, and writing in values using curly brackets {}
print(f"Katie's favourite number is: {katies_favourite_number}")

# Try writing your own sentence here, and include your favourite number as part of the string
print(f"___{___}")

Everything in python has a data type, which defines what it is and what we can do with it<br>
Some common data types in python are:<br>
 - Integer (int): whole numbers such as 1, 100, 200
 - Float (float): numbers with a decimal point such as 1.5, 4.99
 - String (str): a sequence of text characters such as 'hello', 'green'
 - List (list): a sequence of objects (each object can be a different type) such as [1,2,3], [5, 'green', 4.5]

In [None]:
# let's see what data type your favourite number is
type(my_favourite_number)

In [None]:
# because it is a number, we can do some maths functions on it
my_favourite_number + katies_favourite_number

**Top Tip #4**<br>
Errors

Getting errors when you try to run your code is nothing to be afraid of!<br>
Learning how to interpret these errors is a useful skill

In [None]:
# Let's see what happens when we try to add a number (int) to a word (string)
4 + 'green'

At the top, python will tell you what kind of error you have created - here we have a TypeError

Then, you will get an arrow to tell you what line of code has caused your error - here it is line 2

Finally, you will see a 'useful' message about what has caused your error - so here we can see that the '+' function does not support the types 'int' and 'string'. This means that python doesn't know how to apply the 'plus' function between a number and a string

In [None]:
# Let's see what error message we get if we miss out our brackets
print('hello'

Here we can see that we have created a 'SyntaxError'<br>
This means that the way we have structured or written our code is the problem - in this case we have 'unexpected EOF'<br>
EOF stands for 'End of File', which means that python was expecting something else before the end of your code (in this case it is expecting a close-brackets)

# Part 2: It's Panda time
![image.png](attachment:image.png)

- Pandas is a library which allows us to use data in tables (called a **DataFrame**)
- We imported pandas (using the alias pd) at the top of this notebook, so now we can use any of the pandas functions by using 'pd.' followed by the function name. Using 'pd.' just tells python to look in the pandas library for that function
- We can add data into a pandas DataFrame by loading in a csv (comma separated values) file. A csv is a text file containing a table of data - for example you can save a sheet from an excel document as a csv.
- Once we have created a DataFrame, we can apply functions to it using a dot (**.**) after the dataframe name, then giving the function name (and inputs)

In [None]:
# First, let's create a dataframe using the 'titles.csv' dataset
# We use pd.read_csv(): a function to read a comma-separated values (csv) file into DataFrame.
titles = pd.read_csv('titles.csv')

In [None]:
# Now we have loaded our DataFrame called 'titles', we can apply functions to it using titles.function_name()
# Let's use the head() function to look at the first few rows
titles.head()

In [None]:
# If you want to select a single column use square brackets
titles['runtime']

In [None]:
# to select several columns, use a list of columns inside the square brackets
# here our list is ['title', 'runtime'] so we end up with double square brackets
titles[['title', 'runtime']]

In [None]:
# to select a single row, you can select it using the index (iloc)
# in python, indexing starts from zero so the first row will be number 0, the second row is number 1 etc.
titles.iloc[0]

In [None]:
# We can use the describe() function to quickly generate some summary statistics for all of the numeric columns
titles.describe()

**QUIZ**<br>
1. In what year was the oldest title released?
2. What is the average (mean) imdb score in the dataset?
3. How many titles have a release year? How many have an imdb score? Why might these be different?

As we saw when we looked at the first few rows of our dataframe, some values are actually 'NaN'<br>
This stands for 'Not a Number' and it acts as a 'NULL' value where data is missing

In [None]:
# We can look at how many values are populated in each column using count()
titles.count()

In [None]:
# On the other hand, we can find missing values using isnull() and sum() to add up the counts
titles.isnull().sum()

**QUIZ**<br>
1. Which columns have **no** missing values?
2. Which columns are missing the most values?

There are lots of different ways to deal with missing values in a dataset, depending on what kind of analysis we want to do:
- We could remove all rows which have missing values
- We could remove all columns which have missing values 
- We could fill them in using the average or most common value in that column

Try to think of some advantages and disadvantages to each of these approaches

In [None]:
# Let's remove some columns which have lots of missing values
titles = titles.drop(['age_certification', 'seasons'], axis=1)

In [None]:
# Use count() again to see how that has affected the dataframe
___

In [None]:
# We still have some missing values, so let's now just remove any rows which contain a missing value using dropna()
titles = titles.dropna()

In [None]:
# Use count() again to check whether all columns have the same number of values now
___

In [None]:
# Use describe() again to see how our data cleaning has affected the summary statistics
titles.describe()

**QUIZ**<br>
1. In what year was the oldest title released?
2. What is the average (mean) imdb score in the dataset?


# Part 3: Graphs
Matplotlib is a 'library' which provides us with lots of useful functions for data visualisation
We imported it at the top of this notebook, so we can start using it on whatever data we want <br>

You can check out the documentation here: https://matplotlib.org/stable/tutorials/pyplot.html

- Whenever you want to use a function from matplotlib, we have to write **plt .** before the function name, so python knows where to look
- Then we tell it the function name
- Finally, we have to provide some inputs in brackets after the function name. If the function does not require any inputs then we still use brackets, but they will be empty()

In [None]:
# first, let's define some data
# for our first graph, we will just use two lists of numbers for our X and Y values
x = [1,2,3,4]
y = [2,4,6,8]

In [None]:
# now we can create a line chart using the plot() function
# - this is a great way to visualise the relationship between two variables
# x and y are the inputs to the function, so we include them inside the brackets
plt.plot(x, y)
# plt.show() is a bit like the 'print' function, but for graphs
plt.show()

In [None]:
# Now let's add some labels to our graph
plt.plot(x, y)
plt.xlabel('X Values')
plt.ylabel('Y Values')

plt.show()

We can create different types of chart depending on the type of data we want to visualise<br>
For example using a histogram to visualise the disribution of a single variable

In [None]:
# first let's define a random list of 1000 numbers as our data (using a normal distribution)
x = np.random.normal(size=1000)

In [None]:
# we can use plt.hist() to create a histogram
plt.hist(x)
plt.show()

**Now we can use this to visualise our titles data**

Let's take a look at the distribution of runtimes in our titles data

In [None]:
plt.hist(titles['runtime'])
plt.show()

**Challenge**<br>
1. Can you add some labels to our graph?<br>
    Hint - use plt.xlabel and plt.title to add some useful information
2. What kind of distribution does this look like?<br>
    Is it a normal distribtution?
    Does it have a single peak (average) values?
    
    
    
    
<br><br>    
We know our dataset containst both Movies and TV shows, so maybe that is the reason for the distribution we can see<br>
Let's see what happens if we split out Movies and TV shows into separate dataframes

In [None]:
# We use the loc function to look for rows where the 'type' column is equal (==) to 'MOVIE'
movies = titles.loc[titles['type'] == 'MOVIE']

**Challenge**<br>
Try creating a dataframe called shows which contains only rows where the 'type' column is equal to 'SHOW'

In [None]:
shows = titles.loc[titles['type'] == 'SHOW']

Now we have a dataframe containing only movies, let's look at the distribution of movie runtimes<br>
In order to select a single column from your dataframe, we use square brackets

In [None]:
plt.hist(movies['runtime'])
plt.xlabel('Runtime (minutes)')
plt.title('Movie Runtimes')
plt.show()

**Challenge**<br>
Try visualising the distribution of show runtimes<br>
How does this differ to the movie runtimes?

In [None]:
plt.hist(___[___])
plt.xlabel(___)
plt.title(___)
plt.show()

**Challenge**<br>
1. What does the distribution look like now?
2. Can you plot the distributions of other columns in the dataset?

# Part 4 (Advanced): Some actual data science???

You may be thinking 'with all my new SQL and Tableau skills, what do I need python for?'<br>
Everything we have done so far would also be possible using these other skills (or even Excel!)<br>

The real power of Python comes when we want to do more complex things with our data, such as building models and predicting outcomes<br>
For this part of the tutorial, we will be using the 'iris' dataset which contains some flower measurements (in cm) and their species
![image.png](attachment:image.png)

In [None]:
iris = pd.read_csv('iris.csv')
iris.head()

In [None]:
# use describe() to see some summary statistics and find any missing values
___

In [None]:
# use unique() to find out which different species are present
iris['species'].___

In [None]:
# let's visualise some of our data to see if there is any relationship between the different measurements
plt.scatter(iris['petal_length'], iris['petal_width'])
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.show()

**Challenge**<br>
Is there a similar relationship between sepal length and sepal width?

In [None]:
plt.scatter(___ , ___)
plt.xlabel('Sepal length')
plt.ylabel('Sepal width')
plt.show()

It looks like there is a pretty strong relationship between petal length and petal width - <br>
Maybe we can use some data science to predict the width of a petal based on it's length!<br>

For this we will be using one of the most common data science libraries: scikit-learn

In [None]:
# let's import a linear regression model from sklearn
from sklearn.linear_model import LinearRegression

In [None]:
# first let's define our predictor (X) and target (y) variables
# we use .values.reshape(-1, 1) to get the data in the right format for the model to use
X = iris['petal_length'].values.reshape(-1,1)
y = iris['petal_width'].values.reshape(-1,1)

In [None]:
# now we can define our model
linear_model = LinearRegression()

In [None]:
# fit the model to X and y
linear_model.fit(X, y)

In [None]:
# now we have a model, we can use it to predict petal length based on petal width
linear_model.predict(5)

**Challenge**<br>
What is the predicted petal width of a flower which has 1cm long petals?

In [None]:
linear_model.predict(___)

We can use the predict() function on several inputs at once to create an array of predictions

In [None]:
predictions = linear_model.predict(X)

In [None]:
# plotting our predictions against the actual values
plt.plot(X, predictions, color='red', label = 'Predicted Width')
plt.scatter(X, y, color='green', label = 'Actual Width')
plt.xlabel('Petal length')
plt.ylabel('Petal width')
plt.legend()
plt.show()

**CONGRATULATIONS**<br>
You have created your first data science model!<br>
You can read more about the different models and data science functions at https://scikit-learn.org/stable/index.html

**Challenge**<br>
1. Try creating a new model to predict sepal width based on sepal length
2. Check out the sklearn documentation to create a different type of model to predict the species based on flower measurements
3. Use train and test datasets (train_test_split) to evaluate your model performance