# Working with data
## Python Data Analysis Library (a.k.a. Pandas)

***
<br>

## Pandas

* The name is derived from '**pan**el **da**ta', an econometrics term for data sets that contain observations from multiple periods for the same individuals.

<img src="img/pandas-python.jpg" style="width:400px">

## What is Pandas?

<img src="img/pandas-logo.png" style="width:300px">

* Pandas, is one of the most powerful packages, for data analysis in Python.
* It is a kind of swiss army knife in terms of data analysis using Python.
* The library offers a comprehensive range of functionality, from loading data from different types of files into memory through data processing to data visualisation.
* With Pandas we can load data, clean, modify, analyse. Everything that SQL, Excel and much more enables us to do.

## Importing the pandas library

* Importing the pandas library is done in exactly the same way as for any other library.
* In almost all examples of Python code using the pandas library, it will have been imported and given an alias of `pd`.

In [1]:
import pandas as pd

pd.__version__

'1.2.4'

## Pandas data structures

* There are two main data structure used by pandas, they are the **Series** and the **Dataframe**.
* The Series equates in general to a vector or a list. As an analogy, we can compare Series objects to an Excel column.
* The Dataframe is equivalent to a table. Each column in a pandas Dataframe is a pandas Series data structure.

More details and operations on pandas data types will be presented in one of the next lessons of the course.

## Reading a csv file

* In pandas, csv files are read as complete datasets. You do not have to explicitly open and close the dataset.
* All of the dataset records are assembled into a Dataframe.
* If your dataset has column headers in the first record then these can be used as the Dataframe column names. You can explicitly state this in the parameters to the call, but pandas is usually able to infer that there ia a header row and use it automatically.


* Lego data downloaded from rebrickable.com (sets.csv.gz file) will be used as an example.
* The data from the `sets.csv` file will be read into the Dataframe object represented by the `df` variable.
* The Dataframe object accessed by the `df` variable can be thought of as an Excel sheet (rows, columns or individual cells).

In [2]:
df = pd.read_csv("data\sets.csv")

df

Unnamed: 0,set_num,name,year,theme_id,num_parts
0,001-1,Gears,1965,1,43
1,0011-2,Town Mini-Figures,1979,67,12
2,0011-3,Castle 2 for 1 Bonus Offer,1987,199,0
3,0012-1,Space Mini-Figures,1979,143,12
4,0013-1,Space Mini-Figures,1979,143,12
...,...,...,...,...,...
19422,XWING-1,Mini X-Wing Fighter,2019,158,60
19423,XWING-2,X-Wing Trench Run,2019,158,52
19424,YODACHRON-1,Yoda Chronicles Promotional Set,2013,158,413
19425,YTERRIER-1,Yorkshire Terrier,2018,598,0


#### Information about the presentation of the contents of the variable `df`

* You can see the contents by simply entering the variable name.
* You can see from the output that it is a tabular format.
* The column names have been taken from the first record of the file.
* On the left hand side is a column with no name.
* The entries here have been provided by pandas and act as an index to reference the individual rows of the Dataframe.
* Another thing to notice about the display is that it is truncated. By default you will see the first and last 5 rows. For the columns you will always get the first few columns and typically the last few depending on display space.

## ---- Exercise ----

Load the data stored in the `data\countries.csv` file and display the Dataframe containing the loaded data.

In [None]:
# Write your code here
