### Step 1: Import Pandas and Numpy

We will begin by importing the packages that we'll need to use with Python. 

We load pandas with `import pandas` and an extra `as pd` statement. This allows us to call functions from `pandas` with `pd.<function>` instead of `pandas.<function>` for convenience. `as pd` is **not** necessary to load the package.

We also imported the `numpy` package, which is going to help pandas do some of its math.

In [2]:
import numpy as np
import pandas as pd

### Step 2: Import Your Dataset

Input tabular data from a Comma Separated Values (csv) file into a dataframe object that we'll define as `df`.

To create our dataframe object we'll define our object `df` by executing the `pd.read_csv()`function on our data file by inserting the relative file path into the parathenses.

In [3]:
df=pd.read_csv("global_power_plant_database.csv")

### Step 3: Analyzing Data

We will begin by looking at the columns and rows in out data, instead of the actual values.

We will generally explore different variables with df. or df[] notation.

Start by running df.columns, and then go into more specific data

In [6]:
df.columns

Index(['country', 'country_long', 'name', 'gppd_idnr', 'capacity_mw',
       'latitude', 'longitude', 'primary_fuel', 'Unnamed: 8', 'Unnamed: 9',
       'Unnamed: 10', 'Unnamed: 11', 'Unnamed: 12', 'source', 'url',
       'geolocation_source', 'wepp_id', 'year_of_capacity_data', 'Unnamed: 18',
       'Unnamed: 19', 'Unnamed: 20', 'Unnamed: 21', 'Unnamed: 22',
       'Unnamed: 23', 'Unnamed: 24', 'Unnamed: 25',
       'estimated_generation_gwh_2013', 'estimated_generation_gwh_2014',
       'estimated_generation_gwh_2015', 'estimated_generation_gwh_2016',
       'estimated_generation_gwh_2017', 'estimated_generation_note_2013',
       'estimated_generation_note_2014', 'estimated_generation_note_2015',
       'estimated_generation_note_2016', 'estimated_generation_note_2017'],
      dtype='object')

This data shows us all the different variables we are working with. Go ahead and look through to see what can be used and what is unusable for this project.

Once you have decided what is useable, begin to look at those specifically defined varaibles.

In [7]:
df.country_long

0        Afghanistan
1        Afghanistan
2        Afghanistan
3        Afghanistan
4        Afghanistan
            ...     
34931         Zambia
34932         Zambia
34933         Zambia
34934       Zimbabwe
34935       Zimbabwe
Name: country_long, Length: 34936, dtype: object

This function has pulled out all the dupplicates of countries that appear in the data.

We can use the drop_duplicates function to find what is unique in each of these country values.

In [9]:
df.drop_duplicates("country_long")

Unnamed: 0,country,country_long,name,gppd_idnr,capacity_mw,latitude,longitude,primary_fuel,Unnamed: 8,Unnamed: 9,...,estimated_generation_gwh_2013,estimated_generation_gwh_2014,estimated_generation_gwh_2015,estimated_generation_gwh_2016,estimated_generation_gwh_2017,estimated_generation_note_2013,estimated_generation_note_2014,estimated_generation_note_2015,estimated_generation_note_2016,estimated_generation_note_2017
0,AFG,Afghanistan,Kajaki Hydroelectric Power Plant Afghanistan,GEODB0040538,33.0,32.3220,65.1190,Hydro,,,...,123.77,162.90,97.39,137.76,119.50,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1
9,ALB,Albania,Bistrica 1,WRI1002169,27.0,39.9116,20.1047,Hydro,,,...,105.17,75.26,79.50,105.45,88.45,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1
17,DZA,Algeria,Adrar,WKS0068905,20.0,27.9080,-0.3170,Solar,,,...,,35.22,34.22,35.33,35.17,NO-ESTIMATION,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE,SOLAR-V1-NO-AGE
76,AGO,Angola,Biopio,WRI1023002,22.8,-12.4706,13.7319,Oil,,,...,,,,,64.92,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,CAPACITY-FACTOR-V1
90,ATA,Antarctica,McMurdo Station Generator,WRI1023843,6.6,-77.8470,166.6605,Oil,,,...,,,,,,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
34675,VNM,Vietnam,A Luoi,WRI1030864,170.0,16.2266,107.2728,Hydro,,,...,660.65,622.22,518.75,639.21,374.44,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1,HYDRO-V1
34911,ESH,Western Sahara,Dakhla IC Power Plant Western Sahara,GEODB0042583,23.4,23.6816,-15.9594,Oil,,,...,,,,,,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION
34912,YEM,Yemen,Al Hiswa,WRI1022444,125.0,12.8271,44.9227,Oil,,,...,,,,,268.32,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,CAPACITY-FACTOR-V1
34919,ZMB,Zambia,Bancroft Copperbelt,WRI1022387,20.0,-12.3786,27.8317,Oil,,,...,,,,,73.51,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,NO-ESTIMATION,CAPACITY-FACTOR-V1


### Step 4: Filtering our date

Now we will use specified functions to filter our data. This will help us focus on pulling the data we are actually going to use.


In [11]:
df.iloc[:,4] # All rows of column 4

0         33.0
1         10.0
2         10.0
3         66.0
4        100.0
         ...  
34931     50.0
34932     20.0
34933    108.0
34934    920.0
34935    750.0
Name: capacity_mw, Length: 34936, dtype: float64

In [14]:
df.loc[:,["country_long", "capacity_mw", "primary_fuel"]].sample(n=30)

Unnamed: 0,country_long,capacity_mw,primary_fuel
5408,China,400.0,Hydro
11548,France,3.666,Biomass
1280,Brazil,27.0,Wind
10426,France,8.0,Hydro
11788,France,10.4,Wind
4470,Canada,80.0,Wind
1103,Bhutan,1020.0,Hydro
17400,Mexico,4.0,Biomass
33810,United States of America,12.0,Hydro
28803,United States of America,2.6,Gas
