<div align="center">
    <h1>MDAP Intern Knowledge Sharing</h1>
    <h2>Introduction to Python</h2>
    <p>Dr Emily Fitzgerald<br> Research Data Specialist, Melbourne Data Analytics Platform</p>
</div>

This notebook and associated data is available at [github.com/unimelbmdap/Workshops/tree/main/PythonAndJavascript](https://github.com/unimelbmdap/Workshops/tree/main/PythonAndJavascript). Instructions for forking and cloning a repository are available on GitHub at [docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/fork-a-repo). 

## Getting Set Up

### Checking Python is installed on your computer
This tutorial assumes Python is already installed on your computer, and that you are using Python version 3. Information on downloading and installing Python can be found on the [python.org](https://www.python.org/) site. 

To confirm Python is installed on your computer, and see which version the default path is using, in a Terminal shell (Mac) or Powershell (PC) run:

`python -V` or `python --version`

and it will print the Python version number:

![python_versions.png](attachment:c5845249-aa78-44cf-82a0-be26a803d50c.png)

If your computer is using Python version 2, try `python3 -V` or `python3 --version`. If that returns a python 3 version, you will want to use that. To do so, in the next steps replace `python` with `python3`. 

### Creating a Virtual Environment
Best practice is to undertake your Python programming within a virtual environment. Demonstrated here is the method using the inbuilt [Python venv system](https://docs.python.org/3/library/venv.html), but this is just one of the many options available for doing this. 

I want to create a virtual environment (venv) named `training_venv`, which I want to save to the directory `interns_2024`


- In a Terminal shell (Mac) or Powershell (PC), navigate to the directory you want to work in:
  
  Mac: `cd Documents/Training/interns_2024`
  
  PC:  `cd Documents\Training\interns_2024`
  
- Run the command to create the virtual environment:
 
  `python -m venv training_venv`

Note - You don't have to navigate to the directory first, but if you don't, you need to include the path to the directory, e.g. 

`python -m venv Documents/Training/interns_2024/training_venv`

**Note PC users**: If you have not configured the PATH and PATHEXT variables, you may need to provide the full path to your Python executable, e.g.  `C:\Users\Name\AppData\Local\Programs\Python\Python310\python -m venv training_venv`

### Activating and Deactivating the Virtual Environment
If you haven't already, navigate to the directory containing the virtual environment. 

Mac: run the command `source training_venv/bin/activate`

PC: run the command `venv\Scripts\activate`

your venv name will now show in brackets at the start of your command line:

![activate_venv.png](attachment:49030057-beca-4899-a31c-f05c916a857d.png)

To deactivate a venv, simply run the command `deactivate`. 

### Installing Packages in your virtual environment

The [Python Package Index (PyPI)](https://pypi.org/) is an official 3rd party repository for Python software. You can download and install packages from PyPI with the package manager pip. 

For this training, we are going to use the Python library Pandas, so we will need to install it. 

- In a Terminal shell (Mac) or Powershell (PC), ensure you are in the directory you want to work in and your virtual environment has been activated.
- run the command:

  `pip install pandas`


#### Requirements files
If you wanted to include a specific version of a package, you can add the version number after two equals signs: `pip install pandas==2.2.2`

If you have forked or cloned an existing repository with a number of different required packages, or you are wanting to ensure you are using the same packages as someone else, you can use a `requirements.txt` file. This is simply a text file with the package names and, usually, the package version requirements. 

Rather than manually doing a pip install for each package, you can install them all in one step with the `requirements.txt` file:

- run the command:

  `pip install -r requirements.txt`

While you could manually type out a `requirements.txt` file, you can also use pip to take a snapshot of all the packages installed in your virtual environment. To do so, 

- run the command:
  
   `pip freeze > requirements.txt`


### Jupyter

A Jupyter Notebook is an in browser interface that enables you to write and run code, and see the output in the notebook. It supports over 40 languages, including the ability to have cells with markdown text (like this one!)

Jupyter Lab is a more complex integrated development environment, enabling you to run several notebooks at once, as well as easy access to file management, plugins etc. 

To start running Jupyter:
- In a Terminal shell (Mac) or Powershell (PC), navigate to the directory you want to work in and activate your virtual environment.
- to start a classic Jupyter Notebook run the command:

  `jupyter notebook`

- to start a Jupyter Lab, run the command:

  `jupyter lab`

This will automatically open a link in your browser (as well as providing the url if needed). 

The directory structure of your working directory will be visible. To create a new notebook, click the blue 'new notebook' button in Jupyter Notebook, or the blue plus button in Jupyter Lab.  

To deactivate a Jupyter Lab or a Jupyter Notebook server:

- In the Terminal or Powershell window, press `CONTROL-C`, then type `y` to confirm closure

### Importing Libraries and Modules

While we have the libraries that we need installed as part of a package, to use them in a Python script we then need to import them into that script. 

Imports need to be done before the library is used. Convention is to group all your imports together at the top of the script/first cell of a notebook. 

There are different options when importing libraries and modules:
- you can just simply use the `import` call with the library name, e.g. importing the pre-installed library math:

  `import math`

- you can import with an alias, often done when the library name is long or will be used frequently. There are some commonly used aliases, e.g.:

  `import pandas as pd`

  `import numpy as np`

  `import matplotlib.pyplot as plt`


- if you just want a specific module or class within a library, you can specify that; if you want more than one from the same library you can import them on the same line seperated by commas, e.g. importing the `datetime` and `timezone` classes from the preinstalled Python module `datetime`:

   `from datetime import datetime, timezone`

(note you could just import all of datetime, but then you would need to include the class when calling it, e.g. `datetime.timezone()` rather than just `timezone()`.)

In [None]:
import math
import pandas as pd
from datetime import datetime, timezone

## Python Basics

Python is a programming language that is designed to be easily read by humans. It is flexible, able to be used for both functional programming and object-oriented programming. It is interpreted rather than compiled programming language, so that you can run small sections at a time (working well with things like Jupyter Notebooks). 



### Printing
To print (as an output or to the console) the function is `print()`. You can include functions and variables in a print statement, seperated by a comma. 

In [None]:
print('Hello world!')

In [None]:
print(4+6)

In [None]:
print('There are', 1+1, 'cats in my house')

### Comments

Comments are text that is ignored by the interpreter when running your code. In Python, comments are preceeded by a hash `#` symbol:

In [None]:
# This is a comment
4*7 # comments don't have to start at the beginning of a line
# but everything after the comment is started will be included 7*24 # and a second hash doesn't end them

### Variables

Variables are mutable named items that store information, from anything as simple as a value or string, to as complex as a dictionary of dictionaries. It can even be another variable! You don't need to run a command to declare a variable; they are created simply by assigning the value using the equals sign: 

In [None]:
x = 0

name = 'Emily'
print('my name is', name)

name = name.upper() + ' FITZGERALD'
print('my changed name is', name)

x = name
print('now x is', x)

Variable names are case sensitive -`name` and `Name` would be two different variables. 

Be careful when choosing variable names - they can be overwritten with no warning, with e.g.

In [None]:
x = 5
print(x)

x = 20

print(x)

Because of this, you want to make sure you don't give your variable a name that has already been used by another variable, or by a function. There is also a list of reserved words that Python uses so they cannot be used as a variable name:
| |  |  |  |  | |  |  |  | | |  |  |  | | | |
|-- | -- | -- | -- | -- | --| -- | -- | --| --| -- | --| -- | -- | -- | -- | -- | 
| and | as | assert | break | class | continue | def | del | elif | else | except | False | finally | for | from | global | |
| if | import | in | is | lambda | None | nonlocal | not | or | pass | raise | return | True | try |  while | with | yield |

Other rules for variable names are:
- A Python variable name can only contain alpha-numeric characters and underscores (A-z, 0-9, and _ ).
- A Python variable name must start with a letter or the underscore character; it cannot start with a number.

Convention in Python is to use snake case `this_is_a_variable_name` rather than camel case `thisIsAnotherVariableName` (except when naming classes).

You can name multiple variables at the same time, either manually, or if you have a list (this is called unpacking a list):

In [None]:
list_of_names = ['Aiko', 'Chidi', 'Odin', 'Ada']

cat1, cat2, cat3, cat4 = 'Aiko', 'Chidi', 'Odin', 'Ada'
cat5, cat6, cat7, cat8 = list_of_names 

print('cat1: ', cat1)
print('cat2: ', cat2)
print('cat3: ', cat3)
print('cat4: ', cat4)
print('---')
print('cat5: ', cat5)
print('cat6: ', cat6)
print('cat7: ', cat7)
print('cat8: ', cat8)

### Data Types
In Python there are 15 data types within eight categories:

#### Text
- Strings (`str`)
   - create with either single or double quotation marks, or the contructer function `str()`
   - `string1 = 'this is a string with single quotation marks'`
   - `string2 = "this is a string, it's got double quotation marks"`
   - `string3 = str("this is a string with a constructor")`

#### Numeric
- Integers (`int`)
   - create with an integer or the contructer function `int()`
   - `integer1 = 445363567`
   - `integer2 = int(56)`

- Floats (`float`)
   - create with a float (decimal point number) or the contructer function `float()`
   - `float1 = 97.2`
   - `float2 = float(56.34234234)`

- Complex (`complex`)
   - create with a complex number, using 'j' to represent the imaginary part, or the contructer function `complex()`
   - `complex1 = 3j`
   - `complex2 = complex(1j)`

#### Sequence

- List (`list`)
   - create with square brackets or the contructer function `list()`. They are ordered (indexed starting at 0), changeable, and allow duplicate values and different data types. 
   - `list1 = [4, 26.89, 'Amanda', ['list', 'within', 'lists'], 4]`
   - `list2 = list((67, 289.76, "Andrew", False, 67))`

- Tuple (`tuple`)
   - create with with round brackets or the contructer function `tuple()`. They are ordered (indexed starting at 0), and allow duplicate values and different data types. However they are not changeable - items in a tuple cannot be added or removed. 
   - `tuple1 = (4, 26.89, 'Amanda', 4, True]`
   - `tuple2 = tuple((67, 289.76, "Andrew", False, 67))`

- Range (`range`)
   - A sequence of numbers created with the contructer function `range()` and either the final number only (if the range starts from 0) or the start and end numbers). The range does not include the final number. 
   - `range1 = range(5)`
   - `range2 = range(3, 78)`

#### Mapping
- Dictionary (`dict`)
   - A sequence of key value pairs created with curly brackets or the contructer function `dict()` They are changeable, ordered (if using Python 3.7 or higher) but do not allow duplication of keys. 
   - `dictionary1 = {'name': 'Aiko', 'species': 'cat', 'colour': 'black', 'secondary_colour': 'black', 'human': 'Sarah'}`
   - `dictionary2 = dict('name': 'Chidi', 'species': 'cat', 'colour': 'white', 'secondary_colour': 'black', 'human': 'Sarah')`
  
#### Set
- Set (`set`)
   - create with curly braces or the contructer function `set()`. They are unordered, do not allow duplicate values,  and the items within them cannot be changed (though items can be added and removed). 
   - `set1 = {'George', 'John',  'Ringo', 'Paul'}`
   - `set2 = set((67, 289.76, "Andrew", False))`

- FrozenSet (`frozenset`)
   - create with with curly braces inside round brackets or the contructer function `frozenset()`. They are sets that are unchangeable, with no additions or deletions. 
   - `frozenset1 = ({4, 26.89, 'Amanda', 4, True})`
   - `frozenset2 = frozenset((67, 289.76, "Andrew", False))`

#### Boolean
- Boolean (`bool`)
   - `True` or `False` - note in Python they are formatted as title case.  

#### None
- NoneType
   - N/A value formatted as `None`. 

#### Binary
bytes, bytearray, memoryview

In [None]:
string1 = 'this is a string with single quotation marks'
string2 = "this is a string, it's got double quotation marks"
string3 = str("this is a string with a constructor")
integer1 = 445363567
integer2 = int(56)
complex1 = 3j
complex2 = complex(1j)
list1 = [4, 26.89, 'Amanda', ['list', 'within', 'lists'], 4]
list2 = list((67, 289.76, "Andrew", ['list', 'within', 'lists'], 67))
tuple1 = (4, 26.89, 'Amanda', 4, True)
tuple2 = tuple((67, 289.76, "Andrew", False, 67))
range1 = range(5)
range2 = range(3, 78)
dictionary1 = {'name': 'Aiko', 'species': 'cat', 'colour': 'black', 'secondary_colour': 'black', 'human': 'Sarah'}
dictionary2 = dict(name='Chidi', species='cat', colour='white', secondary_colour='black', human='Sarah')
set1 = {'George', 'John',  'Ringo', 'Paul'}
set2 = set((67, 289.76, "Andrew", False))
frozenset1 = ({4, 26.89, 'Amanda', True})
frozenset2 = frozenset((67, 289.76, "Andrew", False))

### Accessing Ordered Data
In ordered data types such as lists and tuples, you can use the index to access individual items by including the index number or range of numbers in square brackets:

In [None]:
list1[0]

In [None]:
tuple2[1:4]

If the list contains lists, these accessors can also be nested:

In [None]:
print('list2[3]:', list2[3])
print('list2[3][1]:',list2[3][1])

### Accessing Named Data
With dictionaries, you can extract values by using their key as an accessor. 

In [None]:
dictionary1['name']

In [None]:
dictionary2['species']

#### f-strings
F-strings enable you to include the values from variables inside a string. To do this, begin the string with `f"` instead of a quotation mark. Variables, functions etc are contained within curly braces in the string:

In [None]:
print(f"There are about {x} people in MDAP, and the interns joining us bring that number up to about {x + 8}!")

In [None]:
print(f"There is a black cat in my house named {cat1}. When we call her, we need to yell {cat1.upper() + '! Treats'.upper()}!!!")

In [None]:
print(f"After me, {list1[2]} will speak about Javascript")

### If Else
Python uses standard math formatting for logic conditions, eg

- Less than: a < b
- Less than or equal to: a <= b
- Greater than: a > b
- Greater than or equal to: a >= b

Equality is checked using a double equals sign `==`, while not equals is `!=`. 

#### If Statement
Python has the abilty to have a conditional `if` command. It is structured as:
- First line begins with `if`, followed by the condition, ending in a colon
- following lines indented, with the action to be performed if the condition is met

In [None]:
if 'single' in string1:
    print('We found the single quotes!')

#### If/Else Statement
This allows you to have an action based on a condition, and also specify what action to take if the condition is not met. It is structured as:
- First line begins with `if`, followed by the condition, ending in a colon
- following lines indented, with the action to be performed if the condition is met
- Then a line that is `else:`
- following lines indented, with the action to be performed if the condition is not met

In [None]:
if len(string1) >= len(string2):
    print('string 1 is longer than or equal to string 2')
else:
    print('string 1 is shorter than string 2')

#### If/Else If/Else Statement
Python also enables you to include a number of different condition options, and also specify the action to take if none of the conditions are met. It is structured as:
- First line begins with `if`, followed by the condition, ending in a colon
- following lines indented, with the action to be performed if the condition is met
- Then beginning with `elif`, followed by the next condition, ending in a colon
- following lines indented, with the action to be performed if the new condition is met
- repeat as many elif statements as required
- Then a line that is `else:`
- following lines indented, with the action to be performed if the condition is not met

Note, as soon as a condition is found, the if statement will end and the action performed. 

In [None]:
time = 8.30
cat2_sleeping = True

if time <=8 and cat2_sleeping:
    print(f"{cat1} to loom over me while {cat2} and I sleep")
elif time <=8 and cat2_sleeping == False:
    print('both cats are awake!')
    print('Ignore their demands for breakfast')
elif time >8 and cat2_sleeping:
    print(f"{cat1} is getting {cat2} awake now!")
else:
    print('better feed the cats!')


### Indentation
One aspect of the syntax that is designed to help with readability is the importance of indentations and whitespace. Commands are completed by going to a new line (rather than needing a semicolon or bracket at the end) and indentation is used to define the scope of things such as loops, functions, and classes. 

Error messages are generally helpful - for example:

In [None]:
count = 0

while count <=100:
print(count)
count += 10

### Loops
Python can do both for loops (for every item in this list/range/etc, do this) and while loops (while this is happening, do that until it stops). 

#### For Loops
For loops are structured as:
- beginning with `for` item `in` thing being iterated through colon
- indented, action to be undertaken


In [None]:
for name in list_of_names:
    print(name)
    print(f"- aka, {name.lower()}")

For loops can include if statements:

In [None]:
vowels = ['a', 'e', 'i', 'o', 'u']
for letter in 'python':
    if letter in vowels:
        print(letter, ': vowel')
    else:
        print(letter, ': consonant')

You can also include `break` and `continue` in actions: 

In [None]:
for i in range(100):
    if i < 75:
        continue
    else:
        print(i)
        break

#### While Loops
While loops state to continue with an action as long as a condition is true. You need to make sure that you set it up so that the condition will become false, or you will be stuck in an infinate loop. 

the `+=` assignment can be useful with while loops - it takes the value of a variable and adds the amount specified, eg

In [None]:
x = 10
print(x)

x+= 7
print('changed to:', x)

In [None]:
count = 0
while count < 10:
    print(f"{10-count}...")
    count +=1
print('LIFTOFF!!!')

### Functions
While there are many preinstalled functions, you can create your own functions in Python. A function is structred as:
- `def` function_name(parameters):
- indented block with processing that the function is doing
- if the function is toreturn an output, indent `return` output

So if I wanted to create a function called simple_multiplyer that multiplied a value by 2, and also printed a statement saying what it had done, it would look like:

In [None]:
def simple_multiplyer(val):
    result = val * 2 # note this is a local variable, can't be used outside of the function
    print(f"Multiplying {val:,} by 2 equals {result:,}")
    return result

In [None]:
result

You call the function by giving the function name with the arguments in brackets at the end. you can also assign the results of a function to a variable:

In [None]:
simple_multiplyer(8)

In [None]:
multiplier_result = simple_multiplyer(45456)

In [None]:
multiplier_result

You can also include default values in your parameters. all named parameters are required to be included as arguments when the function is used; if you don't know how many parameters are needed, you can use *args and **kwargs. 

In [None]:
def multiplyer(val, multiplier_val=3):
    result = val * multiplier_val 
    print(f"Multiplying {val:,} by {multiplier_val} equals {result:,}")
    return result

In [None]:
using_default = multiplyer(5)

In [None]:
using_own_multi = multiplyer(val=67, multiplier_val=234)

In [None]:
def multi_multiplyer(multiplier_val, *args):
    for arg_val in args: 
        print(f"Multiplying {arg_val:,} by {multiplier_val} equals {arg_val*multiplier_val:,}")

In [None]:
multi_multiplyer(5, 4, 7, 8, 343, 14)

In [None]:
def name_game(*args):
    for name in args:
        drop_first = name[1:]
        print(f"{name}, {name}, bo-b{drop_first}")
        print(f"Bonana-fanna fo-f{drop_first}")
        print(f"Fee fi mo-m{drop_first}")
        print(f"{name.upper()}!\n")

In [None]:
name_game('Fred', 'Katie', 'Emily')

## Dataframes with Pandas

There are other libraries available for working with dataframes in Python, but the most common one is Pandas. The Pandas [documentation](https://pandas.pydata.org/pandas-docs/stable/index.html) with the [API reference](https://pandas.pydata.org/pandas-docs/stable/reference/index.html#api) containing details of every object, function, and method. The documentation also has good search engine optimisation, so if you are searching for a Pandas method in a search engine, it will usually be a high result. 

While we can create a dataframe from scratch, using a dictionary, e.g.

In [None]:
example_dict = {
    'names': ['Aiko', 'Chidi', 'Odin'],
    'colours': ['black', 'black and white', 'black and white'],
    'sexes': ['female', 'male', 'male']
}

In [None]:
cat_df = pd.DataFrame(example_dict)
cat_df

Most frequently you will be reading in a file to turn into a dataframe. You can read in a large array of file formats, but the most common would be `pd.read_csv()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html#pandas.read_csv)), `pd.read_excel()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel)), and `pd.read_json()` ([documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_json.html#pandas.read_json)). 

For this workbook, we are using data from [OpenStreetMap](https://www.openstreetmap.org/#map=5/-28.15/133.28) about Greater Melbourne. 

This is a very large dataset, so it has been split into three csv files for sharing. 

Standard convention is to use df as a dataframe name. We will import each with a df_number as the variable name, and call the combined dataframe df. 

First import them using `pd.read_csv()`. I have included the `low_memory=False` as there are columns with mixed data types that give a warning and this avoids that; if it is running slowly, you can remove that argument. 

In [None]:
df_1 = pd.read_csv('GreaterMelbourneOSM_pt_1.csv', low_memory=False)
df_2 = pd.read_csv('GreaterMelbourneOSM_pt_2.csv', low_memory=False)
df_3 = pd.read_csv('GreaterMelbourneOSM_pt_3.csv')

To join the three dataframes into one, we will use `pd.concat`, together with `reset_index()` (each dataframe has an index going from approximately 0 to 490,033. Joining them together means there are repeated numbers in the index, resetting resolves that). 

In [None]:
df = pd.concat([df_1, df_2, df_3]).reset_index(drop=True)

With large dataframes, Pandas  will default to only showing the first 5 and last 5 rows of the dataframe, and the first ten and last ten of the columns. 

In [None]:
df

You can change these, or change the number of characters visible in a column with 

- `pd.set_option('display.max_rows', 500)`
- `pd.set_option('display.max_columns', 500)`
- `pd.set_option('display.width', 1000)`

I reccomend just using `display.max_columns` and set it to show all columns:

In [None]:
pd.set_option('display.max_columns', None)

In [None]:
df

### Summerising the data

You can also see a summary of the column names with the method `columns`

In [None]:
df.columns

You can get an overall picture of your data with the method `info`

In [None]:
df.info()

Each column in the dataframe is called a series. You can get a summary of the values in a particular series with the method `value_counts`

In [None]:
df['osm_type'].value_counts()

You can see the overall size of the dataset with `.shape`

In [None]:
df.shape

### N/A values
We have a lot of N/A values in this dataset. If you have a row or a column that is nothing but N/A's, you can use `.dropna` - e.g.

`df.dropna(how='all', axis=1)`

would drop any column (axis=1) where every value ('how='all') is n/a. 

In our dataset we can see from info that we don't have any columns that are all N/A's, but we have a few with just a small number. I am going to drop every column with fewer than 100 N/A values. 

First I'm going to get a list of the existing columns for comparison's sake

In [None]:
original_cols = df.columns
df = df.dropna(axis=1, thresh=100)

In [None]:
df.head()

I'm going to use list comprehension to see what columns were removed:

In [None]:
removed = [col for col in original_cols if col not in df.columns]
removed

We can also fill N/A's with other values - most commonly, an empty string or a 0 with `.fillna(). 

In [None]:
df = df.fillna('')

### iloc
if we want to look at a particular row more closely, we can use `.iloc` - using the index to pull up the row details. 

In [None]:
df.iloc[4324]

We can also pull out a particular column value, but calling the column name:

In [None]:
df.iloc[4324]['addr_street']

### loc

Similar to iloc, we can use `.loc` to bring up a row, or a collection of rows. Using `.loc` we can also make changes to the rows in the dataframe. 

we can use the index with `.loc` to bring up a row or selection of rows:

In [None]:
df.loc[4324:4326]

or use it to find rows that meet a certain criteria. To do this, the format is:
    
`df.loc[df['column_name'] criteria]

In [None]:
df.loc[df['historic'] == 'archaeological_site']

You can have multiple conditions as part of the filter, but each one needs to be in a set of round brackets, and use either the `&` or `|` operators, rather than `and` and `or`. eg

In [None]:
print(df.loc[(df['tourism'] == 'picnic_site')].shape)

In [None]:
print(df.loc[(df['tourism'] == 'picnic_site') & (df['amenity'] == 'shelter')].shape)

In [None]:
df.loc[(df['tourism'] == 'picnic_site') & (df['amenity'] == 'shelter')] 

We can use `.loc` to select the rows we want to make changes to. For example, postcodes are showing as floats, when they should be integers. If we try and change the data type for the whole column, we get an error, because of the empty values - neither empty strings nor N/A's can be integers.  

In [None]:
# df['addr_postcode'].astype(int)

However, we can use a `.loc` statement to identify all the rows that are blank, and change them to integers, then change the dtype for the column

In [None]:
df.loc[df['addr_postcode'] == '', 'addr_postcode'] = 0

There was also a string in the mix, we could identify that and change to a specific value

In [None]:
df.loc[df['addr_postcode'] == '3006;3130', 'addr_postcode'] = 3006

And then change all the rows where the value is not 0 to be integers. 
We could use `df['addr_postcode'].astype(int)` here, but we want to be able to make the empty rows to be empty string again. 

In [None]:
df.loc[df['addr_postcode'] != 0, 'addr_postcode'] = df['addr_postcode'].astype(int)

In [None]:
df.loc[df['addr_postcode'] == 0, 'addr_postcode'] = ''

In [None]:
df

An important thing to note when using `.loc` statements, is that you are working with the whole series for each column. So if you were to try and use existing string methods, they would either be applied to the whole series and that would be returned for each row, or you would get an error message see for example:

In [None]:
df.loc[:, 'type'] = df['type'].lower()

To resolve this, we add in the `.str` method:

In [None]:
df.loc[:, 'type'] = df['type'].str.lower()

In [None]:
df.head(2)

This also applies when using `.loc` to find particular rows:

In [None]:
df.loc[df['osm_type'].str.upper() == 'RELATIONS']

In [None]:
df.loc[df['top_level_tag'].str.contains('water', case=False)]

### Creating new columns

We can create new columns, either with new data, or by getting data from the dataframe. 

If we simply create a new column by naming it, it will go at the end of the dataframe, eg

In [None]:
df['location'] = 'Greater Melbourne'

In [None]:
df.head(5)

Or we can use `.insert` to create the column in a specific location. For this, the parameters are column number, column title, column contents, eg

In [None]:
df.insert(8, 'address', df['addr_housenumber'] + ' ' + df['addr_street'] + ' ' + df['addr_postcode'].astype(str))

In [None]:
df.head(2)

We can then get rid of the individual address columns with `.drop()`:

In [None]:
df = df.drop(columns=['addr_housenumber', 'addr_postcode', 'addr_street'])

In [None]:
df.head(2)

### Apply

We can use `.apply` to apply a function to every row in a column of the dataframe. 

First we are going to create a function, and then apply it to the `population` column

In [None]:
df['population'].value_counts()

In [None]:
def expand_population(val):
    if val != '':
        new_val = f"Contains {int(val):,} people"
    else:
        new_val = 'No population specified'

    return new_val

In [None]:
df['population'] = df['population'].apply(expand_population) 

In [None]:
df['population'].value_counts()

In [None]:
df.tail(5)

### Map

we can also map data to those in existing columns. If we have a dictionary with keys matching values in a Pandas series, we can use these to insert new values. 

First we create a dictionary:

In [None]:
# df.loc[df['public_transport'] != '']

In [None]:
pt = {
    "platform" : "Railway Platform", 
    "stop_position" : "Tram/Bus Stop", 
    "station" : "Railway Station", 
    "service_center" : "No Public Access", 
}

In [None]:
df['public_transport_alternate'] = df['public_transport'].map(pt)

In [None]:
df.loc[df['public_transport'] != '']

### Exporting the dataframe

If you want to convert your dataframe into a shareable format such as csv file, there are inbuilt methods to do this such as `.to_csv` and `.to_excel`. 

To use `.to_csv` you need to provide the name of the csv file. You can include a whole filepath if you are wanting to save to a different location than your working directory. 

I recommend also using the argument `index=False` unless you have a named index - otherwise the index will end up as the first column of your csv file with the column name 'Unnamed: 0'

In [None]:
df.to_csv('GreaterMelbourneOSM_updated.csv', index=False)