# Data science in Python

- Course GitHub repo: https://github.com/pycam/python-data-science
- Python website: https://www.python.org/ 

## Session 2.1: Working with Pandas

- [Reading CSV Data Using Pandas](#Reading-CSV-Data-Using-Pandas)
- [Exploring our data](#Exploring-our-data)
- [Exercise 2.1.1](#Exercise-2.1.1)

## Mind map

<img src="img/mind_maps/mind_maps.004.jpeg">

## Reading CSV Data Using Pandas

### Import the Pandas library

Pandas is a widely-used external Python library for statistics, particularly on tabular data.
It borrows many features from R’s dataframes.
A dataframe is a 2-dimentional table whose columns have names and potentially have different data types.

Pandas website http://pandas.pydata.org/ and documentation http://pandas.pydata.org/pandas-docs/stable/.

To load `pandas` into your environment, you first need to install it using `pip install pandas` as it is an external third-party library, it is not included by default when you install Python.

When installed, to load it, use `import pandas`:

In [None]:
import pandas

### Read CSV data

For reading a Comma Separate Values (CSV) data file with pandas, we use `pandas.read_csv()`:

- Argument is the name of the file to be read.
- Assign result to a variable to store the data that was read.


The columns in a dataframe are the observed variables, and the rows are the observations. We are going to load a slightly different Gapminder dataset for Oceania, where each columns represent the GDP per capita on different years and each rows a country in Oceania. 

Pandas uses backslash `\` to show wrapped lines when output is too wide to fit the screen.

In [None]:
data = pandas.read_csv('data/gapminder_gdp_oceania.csv')
print(data)

Our course stores its data files in a `data/` sub-directory, which is why the path to the file is `data/gapminder_gdp_oceania.csv`. If you forget to include `data/`, or if you include it but your copy of the file is somewhere else, you will get a runtime error that ends with a line like this:
```
FileNotFoundError: File b'gapminder_gdp_oceania.csv' does not exist
```

## Exploring our data

A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers, floating point values, factors and more) in columns. It is similar to a spreadsheet or an SQL table or the `data.frame` in R. A DataFrame always has an index (0-based). An index refers to the position of an element in the data structure.

In [None]:
data.info()

As expected, it’s a `DataFrame` (or, to use the full name that Python uses to refer to it internally, a `pandas.core.frame.DataFrame`).

It has 2 rows and 13 columns. It uses 288 bytes of memory.

The row headings are numbers (0 and 1 in this case) but we really want to index this DataFrame by country. To do so, we pass the name of the column to `read_csv()` as its `index_col` parameter to do this:

In [None]:
data = pandas.read_csv('data/gapminder_gdp_oceania.csv', index_col='country')
print(data)

In [None]:
data.info()

We can also use the `type()` function to see what kind of thing data is:

In [None]:
type(data)

To see what kind of things does data contain, DataFrames have an attribute called dtypes which returns the data type of each columns. Note that this is an attribute associated to the DataFrame data, and not a method. So do not use `()` to call it.

In [None]:
data.dtypes

It has now 2 rows named 'Australia' and 'New Zealand' and 12 columns, each of which has two actual 64-bit floating point values. It uses 208 bytes of memory.

There are many ways to summarize and access the data stored in DataFrames, using attributes and methods provided by the [DataFrame object](https://pandas.pydata.org/pandas-docs/stable/api.html#dataframe).

To access an attribute, use the DataFrame object name followed by the attribute name `df_object.attribute`. Using the DataFrame `data` and attribute `columns`, an index of all the column names in the DataFrame can be accessed with `data.columns`.

Methods are called in a similar fashion using the syntax `df_object.method()`. As an example, `data.head()` gets the first few rows in the DataFrame `data` using the `head()` method. With a method, we can supply extra information in the parenthesis as arguments to control behaviour.

Let’s look at the data using these.

In [None]:
data.columns

Let's load European's data to have more rows:

In [None]:
eu_data = pandas.read_csv('data/gapminder_gdp_europe.csv', index_col='country')
print(eu_data)

In [None]:
eu_data.info()

Let's find out what the head() method does:

In [None]:
help(eu_data.head)

In [None]:
eu_data.head()

In [None]:
eu_data.head(3)

In [None]:
data.describe()

We often want to calculate summary statistics grouped by subsets or attributes within fields of our data. For example, we might want to calculate the average GDP per capita for 1962.

We can calculate basic statistics for all records in a single column using the syntax below:

In [None]:
data.gdpPercap_1962.describe()

We can also extract one specific metric if we wish:

In [None]:
data.gdpPercap_1962.mean()

In [None]:
data.gdpPercap_1962.std()

The pandas function `describe()` will return descriptive stats including: mean, median, max, min, std and count for a particular column in the data. Pandas’ describe function will only return summary values for columns containing numeric data.

It is not particularly useful with just two records, but very helpful when there are thousands.

In [None]:
eu_data.gdpPercap_1962.describe()

## Exercise 2.1.1

- Read the data in `gapminder_gdp_americas.csv` (which should be in the same directory as `gapminder_gdp_oceania.csv`) into a variable called `americas_data` and display its summary statistics.
- As well as the `read_csv()` function for reading data from a file, Pandas provides a `to_csv()` function to write dataframes to files. Applying what you’ve learned about reading from files, write one of your dataframes to a file called `processed.csv`. You can use help to get information on how to use `to_csv`.

## Manipulating data with Pandas (live coding session)

## Next session

Go to our next notebook: [Session 2.2: Data visualisation with Matplotlib](22_python_data.ipynb)