# **Introduction**

## Learning Objectives

Here, we will aim to:
* Familiarise ourselves with the Kaggle notebook(s)
* Be able to understand integers, floats, Boolean values, strings
* Learn how to assign variables
* Explore lists within Python
* Look into *pandas*, Python's data manipulation and analysis library:
    * How to read data into Python
    * Discovering what information we can find out about dataframes.

## What is a notebook?

Kaggle notebooks are exactly like Jupyter notebooks (usually just “notebooks”). Jupyter notebooks are one of the most commonly used IDE's (integrated development environments) and consist of a sequence of cells, where each cell is formatted in either Markdown (for writing text) or in a programming language of your choice (for writing code). 

Kaggle Notebooks may be created and edited via the Notebook editor. On larger screens, the Notebook editor consists of three parts:

An editing window

A console

A settings window

The Notebook editor allows you to write and execute both traditional Scripts (for code-only files ideal for batch execution or Rmarkdown scripts) and Notebooks (for interactive code and markdown editor ideal for narrative analyses, visualizations, and sharing work).

The main difference between Scripts and Notebooks is the editing pane and how you experience editing and executing code.

Editing
Whether you use Scripts or Notebooks might depend on your choice of language and what your use case is. R users tend to prefer the Scripts, while Python users prefer the Notebooks. For more on why that is, refer to the “Types of Notebooks” section. Scripts are also favored for making competition submissions where the code is the focus, whereas Notebooks are popular for sharing EDAs (exploratory data analysis), tutorials, and other share-worthy insights.

Both editing interfaces are organized around the concept of “Versions”. This is a collection consisting of a Notebook version, the output it generates, and the associated metadata about the environment.

In the Script editor, the code you write is executed all at once, whenever you generate a new version. For finer-grained control, it’s also possible to specifically execute only a single line or selection of lines of code.

Notebooks are built on Jupyter notebooks. Notebook Notebooks consist of individual cells, each of which may be a Markdown (text) cell or a code cell. Code can be run (and the resulting variables saved) by running individual code cells, and cells can be added or deleted from the notebook at any time.

Console
The console tab provides an alternative interface to the same Python or R container running in the Notebook. Commands you input into the console will not change the content of your version. However, any variables you create in the console will persist throughout the session (unless you delete them). Additionally, any code that you execute in the editor will also execute in the console pane.

Settings
In the expanded editor, the settings pane takes up the right side of the screen. In the compact editor (where you hide the settings pane), it is folded into tabs above the Editor tab. In either case the settings pane contains the following tabs:

There's a tab called “Data” that provides a way of adding or removing data from the Notebook.

There's a tab called the Settings. The Settings tab has settings for toggling Language, toggling Docker image selection, toggling Internet (which is on by default), and toggling an Accelerator between CPU (default), GPU, and TPU.


## ** ^Probably won't include this cell in the final notebook, will just walk them through it instead^ **

# Basic Python


## Numeric Values

We can use Python as a basic calculator by entering values into the cells. This doesn't store any of the information, other than in the notebook cell. Rather, you're just passing information to the Python interpreter and receiving an answer.

Python has  - 
* ints (integers, whole numbers) e.g 6, 12, 100000, 8960005060030102
* floats (floating-point numbers, decimal numbers) e.g 3.4, 7.8, 3.14, 98.99997

We can use mathematical expressions

* \+ for addition
* \- for subtraction
* \* for multiplication
* / for division
* \*\* for powers and roots

In [None]:
#Addition of integers produces an integer
1+2

In [None]:
# Addition of ints (produces an int)
1+2

In [None]:
type(10+10) # type() method returns type of the argument(object)

In [None]:
# Addition of Floats (produces a float)
1.0 + 2.0

## Logical Values

Logical values (or "Boolean values") are written as `True` and `False` in Python. (Note: remember to use capital letters!)

They are special Python data types - not strings! So they don't need ' ' or " " around them.

They also have numerics behind them - True is 1 and False is 0.


In [None]:
# You can add them together

True + True

In [None]:
# In fact you can attempt any arithmetic operations with booleans
# This owes to their underlying numeric value of either 1 or 0.
True * 2.5

In [None]:
# Try testing True and True or False and False


In [None]:
# Try testing False and True
True and False

## Text Values
In Python, we refer to text as 'string' data, this is because they are sequences of characters strung together.

* Strings are ordered sequences of characters (letters, numbers, punctuation, whitespace and other special symbols).
* Strings are immutable.
* Use [ ] to access characters in a string
* Strings needs to be enclosed in apostrophes (' ') or quotation marks (" ")
It doesn't matter which, so long as you're consistent. You can't start with ' and end with "!
* 'hello world!' is the same as "hello world!"
* Strings can be manipulated with some arithmetic operators:

* \+ concatenates strings (joins them together).
* \* repeats a given string *n* times.

Strings are *objects* and so have a number of built-in *methods* for manipulating them.
* Use a '.' to access the string object's methods, e.g.
    * *string*`.lower()` transforms a string to all lowercase.
    * *string*`.upper()` transforms a string to all UPPERCASE.
    * *string*`.title()` transforms a string to Title Case.
* Some string methods require parameters, e.g.
    * *string*`.count('a')` counts the number of 'a'characters in the string, returning an integer.
    * *string*`.replace(" ","_")` replaces any spaces in the string with underscores.
    * *string*`.split(" ")` splits the string wherever a a space is found. Results in a list of elements split on space.
    

In [None]:
# try entering a string with quotation marks ""


In [None]:
# now try with apostrophes ''


In [None]:
# We can use the + symbol here too - but it works differently!
"Data" + " Science " + "Campus"

In [None]:
# Multiplication too!
"ONS" * 5

In [None]:
#.lower()
'ONS'.lower()

In [None]:
#.count()
'ons'.count('o')

In [None]:
#.replace()
'ons'.replace('o','0')

Now you try!

## Exercise 1

1. Take 2 integers and multiply them. Print the result.
2. Find the type of your result from Question 1.
3. Using the string "I like learning to code"

    a) Uppercase the whole string
    
    b) Replace 'like' with 'love'
    
    c) Count the number of times the letter 'e' is used in the string.

## Answers

In [None]:
# print("1.",10*123)

# print("2.",type(10*123))

# string = "I like learning to code"
# print("3a.", string.upper())
# print("3b.", string.replace("like", "love"))
# print("3c. Number of e's in this string: ", string.count('e'))

## Variable Assignment


* Variables store data values under a specific name:
    * `name = 'Georgina'` - here the variable `name` is storing the string `'Georgina'`.
    * `age = 23` - here the variable `age` is storing the integer `23`.
    * `height = 1.60` - here the variable `height` is storing the float `1.60`.
* Variables have to be assigned. Assignment is done with the `=` (equals) sign.
* Once created, using a variable means getting the value of whatever it is storing.
    * Inputting `height` into Python will return `1.60` until the variable is either changed or deleted.
* We can change the value of variables at any point by assigning it a new value.
    * `height = 1.75` - reassigns the height variable to the value `1.75`
    * `height = "egg"` - reassigns the height variable to the value `"egg"`
    * `del height` - deletes the height variable, it is no longer in Python's memory.
* Variable names are case sensitive
* Variable names can't start with a number (must start with an _ or a letter)
* Can't use reserved words - like ` True, Class, Yield` which already mean something to Python.
* Should be descriptive (ideally).
* Can be any length - but should be sensible!
* Can look like this
    * tomjones (lowercase)
    * TOMJONES (UPPERCASE)
    * tomJones (camelCase)
    * TomJones (UpperCamelCase)
    * tom_jones (snake_case, can also be Tom_Jones)

In [None]:
name = 'Tom'
surname = "Jones"
print(name + " " + surname)

Remove the #s from the cell below and run the cell.

In [None]:
# name = input("Please enter your name")
# print ("Welcome to the Data Science Campus", name)

## Lists

Lists are used to store multiple items in one variable.

Lists are one of 4 built-in data types in Python used to store collections of data, the other 3 are Tuple, Set, and Dictionary, all with different qualities and usage.

Lists are created using square brackets:

In [None]:
myList = ['red', 'black', 'blue', 'orange']
print(myList)

List items are ordered, changeable, and allow duplicate values.

List items are indexed, and the first item has index `[0]`, the second item has index `[1]` etc.

### Ordered

When we say that lists are ordered, we mean that the items in the list will have a defined order, and that order will not change. Adding new items to the list will place the new items at the end.

### Changeable

Lists are changeable, meaning we can change, add and remove items in a list after it has been created.

### Allow Duplicates

Since lists are indexed, lists can have items with the same value. For example:



In [None]:
myList = ['red', 'black', 'blue', 'orange', 'red']
print(myList)

### List Length

To find out how many items are in a list, we use the `len` function.

In [None]:
myList = ['red', 'black', 'blue', 'orange']
print(len(myList))

### Access Items 

Because list items are indexed, you can access them by referring to their index number. For example, if you wanted to print the third item in the list:

In [None]:
myList = ['red', 'black', 'blue', 'orange']
print(myList[2])

### Negative Indexing

When using negative indexing, we start from the end of the list. For example, if we wanted the last item in the list:

In [None]:
myList = ['red', 'black', 'blue', 'orange']
print(myList[-1])

### Changing item value

To change the value of a specific item, we refer to the index number. For example, if we wanted to change the second item in the list:


In [None]:
myList = ['red', 'black', 'blue', 'orange']
myList[1] = 'green'
print(myList)

### Insert items

To insert a new item into the list, without replacing any of the existing values, we can use the `insert()` method. So if we wanted to insert 'white' as the third item:

In [None]:
myList = ['red', 'black', 'blue', 'orange']
myList.insert(2, 'white')
print(myList)

### Adding items

To add items to the end of the list, we use the `append()` method. 

In [None]:
myList = ['red', 'black', 'blue', 'orange']
myList.append('purple')
print(myList)

### Removing items

The `remove()` method removes the item we want to take out of the list. For example, if we wanted to remove 'orange' from the list:


In [None]:
myList = ['red', 'black', 'blue', 'orange']
myList.remove('orange')
print(myList)

### Deleting and clearing a list

The `del` keyword can remove either a specified index or the entire list.

In [None]:
#deleting the first item in the list:
myList = ['red', 'black', 'blue', 'orange']
del myList[0]
print(myList)

In [None]:
#deleting the list completely:
myList = ['red', 'black', 'blue', 'orange']
del myList

the `clear()` method will empty the list. The list still remains, but has no content.

In [None]:
myList = ['red', 'black', 'blue', 'orange']
myList.clear()
print(myList)

### Joining two or more lists

There are a couple of ways to join (concatenate) two or more lists in Python.
You could:

* use the `+` operator
* use the `extend()` method, which would add elements from one list to another list
* append all the lists from one list into another, one item at a time.

In [None]:
#using the + operator
list1 = ['one', 'two', 'three']
list2 = ['1', '2', '3']
list3 = list1+list2
print(list3)

In [None]:
#using the extend() method
list1 = ['one', 'two', 'three']
list2 = ['1', '2', '3']
list1.extend(list2)
print(list1)

In [None]:
#using the append() method
list1 = ['one', 'two', 'three']
list2 = ['1', '2', '3']

for x in list2:
  list1.append(x)

print(list1)

## Exercise 2

1.  Below is a Python list that outlines a bus journey with stops. Print out stop 2 (place and time of arrival).
    
    ```python
        route1 = [['Glasgow', '09:00'],['Falkirk', '10:00'],['Edinburgh', '10:30']]
    ```


2.   Using the `len` function, return how many stops are in route 1.

3.   Roadworks in Edinburgh mean that the bus will actually arrive at the last stop half an hour later than planned. Change the last item in route 1 to reflect this. 

## Answers

In [None]:
# 1. 
#route1 = [['Glasgow', '09:00'],['Falkirk', '10:00'],['Edinburgh', '10:30']]
#print(route1[1])

#2.
#print(len(route1))

#3. 
#route1[2] = ['Edinburgh', '11:00']
#print(route1)

# Introduction to Pandas

Pandas is a Python library used for working with datasets. It has functions for analyzing, cleaning, exploring and manipulating data. Data Scientists use pandas in order to analyze big data and make conclusions based on statistical theories. Pandas can make messy datasets readable and relevant, something which is extremely important in data science.

Pandas can help us answer questions like:
* Is there a correlation between two columns?
* What is the average value of the data?
* What is the min/max value of the data?

As Pandas is a third party library, we must import it into the notebook we are working in, like this:

``` python
    import pandas as pd
```

### What is a dataframe?

A Pandas dataframe is a 2D data structure - a table with rows and columns. A simple dataframe would look like this:

In [None]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

#load data into a DataFrame object:
df = pd.DataFrame(data)

print(df) 

### Locating a row

The above datafame looks exactly like a simple table with rows and columns. In Pandas, we can use the `loc` attribute to return specific rows. For example,if we wanted to focus on the first row:

In [None]:
#refer to the row index:
print(df.loc[0])

In [None]:
#to focus on both the first and second row:
print(df.loc[[0, 1]])

### Naming Indexes

With the `index` argument, we can name our own indexes (i.e. give each row a name). For example:

In [None]:
import pandas as pd

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])

print(df) 

### Locating named indexes

We can then use the names index in the `loc` attribute to return a specified row. For example, if we just wanted to look at Day 3:

In [None]:
#refer to the named index:
print(df.loc['day3'])

### Reading in CSV files

One of the most simple and common ways to store big datasets is to use CSV files (CSV stands for comma separated values). 

CSV files contain plain text and is a well known, easily read format.

In this example, we will be using a CSV file called 'data.csv'. You can download the file [here](https://www.w3schools.com/python/pandas/data.csv):

We're now going to load in the CSV to a dataframe.

In [None]:
import pandas as pd

df = pd.read_csv('../input/w3dirtydata/dirtydata.csv')

#print the entire dataframe
print(df.to_string()) 

### Viewing the Data

By default, typing `print(df)` will print first 5 rows and the last 5 rows. We can use the `head()` method to see the headers and a specified number of rows, starting from the top of the dataframe.

For example, if we wanted to view the first 10 rows of our dataframe:

In [None]:
import pandas as pd

df = pd.read_csv('../input/w3dirtydata/dirtydata.csv')

print(df.head(10))

The opposite to the `head()` method is the `tail()` method, where you can view the *last* rows of the dataframe. To view the last 10 rows:

In [None]:
print(df.tail(10))

### Finding information about the data

We can use the `info()` method to give us more information about our dataset.

In [None]:
print(df.info()) 

So we can see there are 169 rows and 4 columns, and the names of the four columns are `Duration`, `Pulse`, `Maxpulse`, and `Calories.`

### Null values

Null values, or empty values, are usually removed before any data is analyzed. Removing rows that contain empty values is one step of cleaning the data ready for analysis.

### Data Cleaning

When we talk about data cleaning, we are aiming to fix 'bad data' in the dataset.

Bad data could be:
* empty cells
* data in the wrong format
* wrong data
* duplicates

Now that we know this, we'll take a look at a dataset which can be downloaded from [here](https://www.w3schools.com/python/pandas/dirtydata.csv) if you'd like to see it:

In [None]:
import pandas as pd

df = pd.read_csv('../input/w3dirtydata/dirtydata.csv')

#view top 30 rows
print(df.head(30))


The dataset contains some empty cells ("Date" in row 22, and "Calories" in row 18 and 28).

The dataset contains wrong format ("Date" in row 26).

The dataset contains wrong data ("Duration" in row 7).

The dataset contains duplicates (row 11 and 12).

### Removing rows

We will deal with the empty cells by removing any rows that contain empty cells. This is usually okay as removing a couple of rows will not have a big impact.

To create a new dataframe with no empty cells:

In [None]:
new_df = df.dropna()

print(new_df.to_string())

Note: we haven't changed the original dataframe, only created a new one. To change the original, we can use the `inplace = True` argument:

In [None]:
#remove all rows containing null values from the original dataframe
df.dropna(inplace = True)

print(df.to_string())

### Replacing empty values

Instead of removing the rows that contain empty values, we could insert a new value instead. We can use the `fillna()` method for this. So if we wanted to replace all null values with the number 130:

In [None]:
df = pd.read_csv('../input/w3dirtydata/dirtydata.csv')

#replace all empty cells in the whole dataframe:
df.fillna(130, inplace = True)

In [None]:
df = pd.read_csv('../input/w3dirtydata/dirtydata.csv')

#to replace empty values for one column, specify the column (in this case "Calories":)
df["Calories"].fillna(130, inplace = True)
print(df)

### Converting data into the correct format

If a cell has data in the wrong format, it can be very difficult to analyze. 

We can either remove the rows or convert all the cells in the columns to the same format.

In our dataframe, we have two cells in the wrong format. In the `Date` column, rows 22 and 26 should be formatted as a string.

We can deal with this by first converting all cells in the `Date` column into dates, using the `to_datetime()` method.

In [None]:
df['Date'] = pd.to_datetime(df['Date'])

print(df.to_string())

So row 26 has been fixed, but row 22 still contains a null value. We will remove this row.

In [None]:
#remove rows with null value in `Date` column:
df.dropna(subset=['Date'], inplace = True)

### Wrong data

Wrong data doesn't necessarily have to be data in the wrong format or empty cells, it could be data where there is a mistake in data entry. (For example, if someone entered 1000 instead of 100)

How do we tell if data is wrong? Sometimes the easiest way is just to look at the dataset. In our dataset, we have a value of 450 in the `Duration` column (row 7). This might not be wrong, but as we are talking about the duration of workouts, it's safe to assume the person did not work out for 7.5 hours! So, we can fix it by replacing the value. The error was probably due to a typo, so let's change 450 to 45.

In [None]:
#set duration = 45 in row 7:
df.loc[7, 'Duration'] = 45

print(df)

### Duplicate rows

Duplicate rows are rows that have been entered more than once. In our dataset, we can see rows 11 and 12 are identical. 

To remove duplicates, we use the `drop_duplicates()` method.

In [None]:
#remove all duplicates
df.drop_duplicates(inplace = True)
print(df)

And just like that, the dataset is clean and ready for analysis! Now we can use what we've learned in the practice task, which you can find [here](https://www.kaggle.com/miachatton/eyf-practice-task/):