# Intro to python

This notebook has been set up to cover some introductory concepts in python for CCC colleagues.

### Interacting with the notebook environment

The term "notebook" refers to an environment that contains text and instructions (like the contents of this cell), while also allowing the user to dynamically write and execute code in other cells. Let's show how to do that.

In [None]:
# This is a code cell
# This line is not code - we can use hashtags at the start of a line to write annotations and comments

# The next line is code - we can see what it produces when we run this cell!

print("Hello world!")

### Variables

In python we can define variables, which we store using the "=" operator. These can then be called on later.

In [None]:
# defining a set of variables

instructor = "Fergal"
instructor_age = 26

students = ["Luke", "Eoin", "Luke", "Miranda", "Jamie", "Sam"]

In [None]:
# if we call on one of these

instructor_age

In [None]:
# We can overwrite variables

instructor_age = 27

instructor_age

### Data types

Python has defined data types, which come with a set of properties and in-built functionality. The variables that we defined previously are of different types. Let's investigate.

In [None]:
# using type

type(instructor)

Our instructor variable is of type "str". This is short for "string", and is the terminology used in python to refer to text. We can do lots of things with strings, including taking smaller bits of them, adding them to each other, capitalising them etc. Let's explore some of these.

In [None]:
# defining a new variable to show how we can take smaller bits of a string

instructor_nickname = instructor[:4] # taking the 4 first letters in the instructor's name using indexing

instructor_nickname

In [None]:
# and another variable to show how we can add them together using the + operator

instructor_nickname_annoying = instructor + "icious"

instructor_nickname_annoying

Let's look at our other two variables, starting with the instructor_age variable.

In [None]:
# using type functionality again 

type(instructor_age)

Our instructor_age variable is an "int" - short for integer. This is one of the main types of numerical data, the other main type being floats, which is used for decimal numbers. As you would expect, we can do standard mathematical operations with this data type.

In [None]:
# adding to this number

new_instructor_age = instructor_age + 1

# doing subtraction and multiplication

instructor_age_by_nz = (2050 - 2025) + instructor_age

# dividing by another number to get the ratio of years

ratio = instructor_age_by_nz / instructor_age

#TODO: Add more to this section


Finally, looking at our students variable.

In [None]:
type(students)

This is a "list", which is short for list. A list is a collection of other data types arranged in a particular order. We can do lots of operations on lists too.

In [None]:
# if we want to accept a new student

students.append("Thomas")

print("New student accepted:")
print(students)

# if we want to remove a particularly annoying student

students.remove("Jamie")

print("Annoying student removed:")
print(students)

### Control flow

Control flow is a key concept in programming languages. It refers to the set of instructions we can use to direct the execution of a program. There are two main types of control flow statements: 
* Conditional statements: These use the key terms "if", "else", "elif"
* Loop statements: These use the key terms "for", "while"

Let's look at these in more detail.

In [None]:
# conditional logic

if instructor == "Fergal":
    print("That's the right person to be running this training.")
elif instructor == "Jamie":
    print("Oh god please no!")
else:
    print("At least the instructor isn't Jamie!")

In [None]:
# loop logic

for student in students:
    print("The student's name is below:")
    print(student)

### Functions

Functions are a key concept in programming languages in general. They are used to define a set of instructions that we can use elsewhere. In python they are written using a specific syntax which is shown below.

In [None]:
# defining an age_addition function which takes two inputs "arguments" and adds them together

def age_addition(current_age, age_to_add):

    new_age = current_age + age_to_add

    return new_age

Some things to note about this syntax:
1) We use the key term "def" to let Python know that what follows is a function.
2) The function name is followed by brackets.
3) Within these brackets are two additional terms: these are known as function "arguments".*
4) We have a ":" after this.
5) The code which defines the functions operations is indented relative to the first line.
6) There is a "return" key word at the end. Whatever we put after this key word ends up being what is returned (!) by the function.

*These are NOT variables, and are not defined outside of the scope of the function. You can convince yourself of this by trying to use them in a standalone context.

In [None]:
# demonstrating the point about function scope

print(new_age)

You can see above that the function arguments do not exist outside of the function. In addition to this, the function arguments need to be assigned when we try to run it for anything to happen:

In [None]:
# if we try running the function with no arguments

age_addition()

In [None]:
# now assigning these 

age_addition(current_age=instructor_age, age_to_add=5)

In the example directly above, we have not assigned the results to a variable. This means that while the result is output, we can't retrieve it at a later point in time. Let's call the function with different age_to_add arguments and store the results.

In [None]:
# defining the results

still_young_instructor = age_addition(current_age=instructor_age, age_to_add=2)
relatively_old_instructor = age_addition(current_age=instructor_age, age_to_add=50)

print(f"The still-young instructor is {still_young_instructor} years old.")
print(f"The relatively old instructor is {relatively_old_instructor} years old.")

This is a very basic introduction to functions - there is much more nuance to understand about how to use and implement these. However, the main point to take away for now is that functions are a way of writing custom instructions which we can reuse many times. The example we have written here does something trivially easy, but many more complicated examples can be imagined.

### Packages

A reasonable amount of functionality is in-built with Python, covering everything we have done above and more. However, one of the great things about Python (and other languages, like R) is that they have an active user base who contribute packages. Packages are add-on pieces of functionality which have been designed to help with particular (or quite general) uses. 

### Pandas

One of the most popular packages that we will certainly be using in our work is pandas, which provides a range of functionality for manipulating tabular data. Let's take a look.

In [None]:
# installing the pandas package

!pip install pandas

In [None]:
# now importing it into this notebook so we can use it

import pandas as pd

The central concept in pandas is something called a DataFrame, which is basically a data table defined with a set of rows and columns, and a lot of integrated functionality. DataFrames can be created by reading in files (e.g. from Excel), which is what we're going to do to explore some basic functionality here.

We've got the cb7 full dataset saved in a data subfolder within this repository. Let's take a look at that using pandas.

In [None]:
# we know our file has multiple tabs
# note use of pd. syntax, which indicates that we are using some pandas functionality

file = pd.ExcelFile("../data/cb7_full_dataset.xlsx")
file.sheet_names

In [None]:
# we don't want to look at all of these tabs now - let's just look at the sector-level data
# note that the term df is commonly used as a shorthand for DataFrame

df = pd.read_excel(file, sheet_name="Sector-level data")

In [None]:
# we've now got a DataFrame
# Let's look at the first few rows to confirm it is what we want

df.head(5)

In [None]:
# we can also use some very general pandas methods to describe our dataset
# first the .info method, which gives us an overview

df.info()

In [None]:
# we can use .describe to give us basic stats about the numeric columns

df.describe()

In [None]:
# that looks about right
# we can use some other basic functionality to inspect individual columns
# note use of square brackets to access individual columns

print(f"The data covers the following years: {df["year"].unique()}") # looking at the year-range

In [None]:
# we can also use a loop to look at main entries in each columns:

for column in df.columns:
    print(column)
    print(df[column].unique())

The above operations give us a pretty good idea of what is in our data. We can now look at doing some basic operations to transform it.

In [None]:
# we can filter the data based on conditions, in the same way as you might in Excel
# let's also save the results to a new dataframe to carry forward

cost_df = df.loc[df["variable"].str.contains("Cost")] # using .loc functionality to filter to only cost data

In [None]:
# let's inspect this new dataframe

cost_df

In [None]:
# We can create new columns based on entries in other columns

cost_df["cost_type"] = pd.NA # defining an empty column
cost_df.loc[cost_df["variable"].str.contains("capital"), "cost_type"] = "Capital" 
cost_df["cost_type"] = cost_df["cost_type"].fillna(value="Operational")

cost_df

In [None]:
# if we wanted a more concise way of doing this, we could use a "lambda" function with the pandas .apply method

cost_df["cost_type_by_lambda"] = cost_df["variable"].apply(lambda x: "Capital" if "capital" in x else "Operational")
cost_df

In [None]:
# we can check that these two methods give us the same thing

cost_df.loc[cost_df["cost_type_by_lambda"] != cost_df["cost_type"]] # looking at where these are different

In [None]:
# we can also do grouping operations very easily

grouped_costs_df = cost_df[["year", "value"]].groupby(by=["year"]).sum()

grouped_costs_df

In [None]:
# we can sort data to look at how costs evolve over time

grouped_costs_df.sort_values(by="value",ascending=False)

There's lots more to be added here, but this will do for now.