# Pandas
Pandas is the go-to for data processing in python. (it is built on Numpy).

Pandas provides tabular data structures for high-performance functions to analyse, clean, explore, and manipulate ‘big’ multidimensional and heterogeneous data.

Think of Pandas data structures like you would a spreadsheet (like in Excel) or an SQL table – but faster, easier and more powerful.

Pandas has two core data structures:

**Series** - A one-dimensional labelled homogeneous array containing data of any type. It is similar to a list, but has additional features such as indexing, slicing, and missing value handling. 

**DataFrame** - A two-dimensional, labelled, and heterogeneous tabular data structure with its own indexing system.

First, lets make sure pandas has been imported:

In [None]:
import pandas as pd

## Let's work with Pandas Dataframes
The first step will involve you creating a hypothetical dataframe and then you will move on to importing energy data with Pandas to perform basic descriptive statistics.

There are many ways to create a DataFrame, from scratch, most closely resemble a python dictionary with nested lists.

In [None]:
df = pd.DataFrame({"Country": ["Namibia", "Portugal", "Egypt", "Haiti", "Thailand", "Bolivia", "Estonia"],
                   "Primary Energy Consumption (TWh)": [22, 258, 1105, 12, 1406, 85, 62]})
df

## Activity 1: Creating your own data and placing it into a Pandas dataframe   

1. Using data from [OurWorldInData](https://ourworldindata.org/grapher/primary-sub-energy-source?tab=table), create a DataFrame for your group’s country which contains the primary energy consumption by source for at least 3 year and at least 3 sources.

     Your index should be the year, and your columns should be the energy source

2. Use the ```.mean()``` and ```.describe()``` methods for your data. Try setting the ```axis``` keyword to 0 or 1. What does it do?
3. Try plotting the data using the .plot() method. Change the type of plot using the ```kind``` keyword argument.


In [None]:
# create your dataframe here



In [None]:
# understand how to use the .mean() function which is a method of the dataframe object

In [None]:
# another useful built-in method for quick data exploration is the .plot() method

### There are a few ways of importing data into Python and Pandas makes it very convenient.  
  
Have a look at the code below to see two examples  
  
`import pandas as pd
df = pd.read_csv('filename.txt', sep=" ", header=None, names=["a","b","c"])`
  
Or  
  
`import pandas as pd 
df = pd.read_csv('file_location\filename.txt', delimiter = "\t")  # Can input a URL as well`



#### Go ahead and import the data

The code below will use a URL to import data on the power plants of the USA, using the simpliest form of input arguements to the Pandas `read_csv` function. Note how we imported Numpy as `np` and not Pandas as `pd`. Dataframes are conventionally called `df` in Python but you can rename to another logical variable name. 


In [None]:
# The following implies an internet connection of course
import pandas as pd
usa_gen = pd.read_csv('https://raw.githubusercontent.com/wri/global-power-plant-database/master/source_databases_csv/database_USA.csv')


#### Let's have a look at the data

The `head` function is an easy way to inspect your dataset. The default number of rows to show is 5. Let's explore the dataset a bit. Use `df.head(20)` to see more rows etc. 
  
  

In [None]:
usa_gen.head(10)

### Indexing with Pandas
There are many ways to index

In [None]:
# dot notation
df.Country  

In [None]:
# or using square brackets
df["Primary Energy Consumption (TWh)"].head(3)  

In [None]:
# or using the .loc function
df.loc[2:4, "Country"]

In [None]:
# you can also use conditional statements to index
df[df["Primary Energy Consumption (TWh)"] > 100]

In [None]:
# or use for loops to iterate through an axis
for i, row in df.iterrows():
    print(row["Country"])

## Activity 2

 Let's pull some data from the `usa_gen` DataFrame that we have loaded based on some of the information given to you through the graphic above. *Hint: It is best to only use the data that you need.
&nbsp;

**Use the cells below to print the following values:**

1. The total number of power plants in the US.
2. The minimum, maximum and average capacity of the plants in MW.
3. The number of listed solar farms.
4. The number of different plant types listed.
5. The year with the highest generation from all of the plants.

In [None]:
# The total number of power plants in the US with capacity greater than 5MW.

In [None]:
# The minimum, maximum and average capacity of the plants in MW.

In [None]:
# The number of listed solar farms.

In [None]:
# The number of different plant types listed.

In [None]:
# The year with the highest generation from all of the plants.

### Statistics, plotting etc
Pandas also provides lots of functions that can be used for statistical analysis and plotting (and many more) of large datasets

In [None]:
# to calculate some summary statistics on the USA power plant capacity data
usa_gen["capacity_mw"].describe()

In [None]:
# to plot capacity vs commissioning year
usa_gen.plot(x="commissioning_year", y="capacity_mw", kind='scatter')

## Activity 3
Try changing the variables and type of plot produced to investigate other features in the data.