## Introduction to the pandas module

Built upon NumPy, the Pandas module is essential to data science. Pandas has easy-to-use tools for acquiring, cleaning, mainpulating, and displaying data. It introduces two new data structures; **series** and **dataframe**.

### Series data structure
- A Pandas Series is a one-dimensional, labeled data structure with all elements of the **same type**.
- A Series can hold any data type. It is **similar to a Python list**.

### The Pandas Dataframe
- A dataframe is a two-dimensional, labeled data structure consisting of **rows and columns**.
- Rows are **indexed** starting with **zero**.
- Each **column** in a dataframe is a **series** object.

### Acquiring data



Pandas has methods for importing data from a variety of sources:
    
import pandas as pd

    - pd.read_csv(filename) | From a CSV file
    - pd.read_table(filename) | From a delimited text file (like TSV)
    - pd.read_excel(filename) | From an Excel file
    - pd.read_sql(query, connection_object) | Read from a SQL table/database
    - pd.read_json(json_string) | Read from a JSON formatted string, URL or file.
    - pd.read_html(url) | Parses an html URL or file and extracts tables to a list of dataframes
    - pd.DataFrame(dict) | From a dict, keys for columns names, values for data as lists

In [55]:
# you can generate a dataframe with random data for testing

import pandas as pd

# make a df with 12 rows and 6 columns of floats
df_fake = pd.DataFrame(np.random.rand(12,6))
print(df_fake)

           0         1         2         3         4         5
0   0.482669  0.964623  0.441755  0.164572  0.655034  0.276603
1   0.562612  0.663198  0.283033  0.528836  0.625107  0.525557
2   0.951833  0.168371  0.677403  0.829570  0.357788  0.703179
3   0.509530  0.782079  0.672833  0.052213  0.106995  0.615498
4   0.001125  0.697431  0.608288  0.179210  0.779170  0.284910
5   0.234731  0.996922  0.042130  0.030645  0.995511  0.018620
6   0.797007  0.007676  0.409448  0.335556  0.183252  0.283193
7   0.781679  0.167867  0.556331  0.415106  0.584109  0.429637
8   0.486461  0.449072  0.372873  0.500162  0.846352  0.303331
9   0.054186  0.391377  0.404824  0.959971  0.006627  0.351770
10  0.421768  0.596311  0.034870  0.383158  0.832110  0.368695
11  0.149383  0.803241  0.300618  0.243968  0.027780  0.347651


In [58]:
# you can read a CSV file from the Internet

# Read 2014 Apple stock price into a Pandas dataframe 
df = pd.read_csv('https://raw.githubusercontent.com/plotly/datasets/master/2014_apple_stock.csv')

print(df)

         AAPL_x      AAPL_y
0    2014-01-02   77.445395
1    2014-01-03   77.045575
2    2014-01-06   74.896972
3    2014-01-07   75.856461
4    2014-01-08   75.091947
..          ...         ...
235  2014-12-08  113.653345
236  2014-12-09  109.755497
237  2014-12-10  113.960331
238  2014-12-11  111.817477
239  2014-12-12  110.027139

[240 rows x 2 columns]


### Inspecting or viewing a dataframe

Pandas dataframes have many tools for accessing data
- df.info() method: provides information about index, datatype and memory
- df.shape attribute: returns the number of rows and columns
- df.head(n) method: displays the first n rows
- df.tail(n) method: displays final n rows of the dataframe
- df.describe() method: prints summary statistics for numerical columns

In [61]:
print(df.info())
print('\nDataframe shape (rows,columns)')
print(df.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country      1704 non-null object
year         1704 non-null int64
pop          1704 non-null float64
continent    1704 non-null object
lifeExp      1704 non-null float64
gdpPercap    1704 non-null float64
dtypes: float64(3), int64(1), object(2)
memory usage: 80.0+ KB
None

Dataframe shape (rows,columns)
(1704, 6)


In [62]:
print(df.head(6))         # display first 6 rows
print(df.tail(4))         # display last 4 rows

       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3  Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4  Afghanistan  1972  13079460.0      Asia   36.088  739.981106
5  Afghanistan  1977  14880372.0      Asia   38.438  786.113360
       country  year         pop continent  lifeExp   gdpPercap
1700  Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701  Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702  Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703  Zimbabwe  2007  12311143.0    Africa   43.487  469.709298


In [63]:
print(df.describe())

             year           pop      lifeExp      gdpPercap
count  1704.00000  1.704000e+03  1704.000000    1704.000000
mean   1979.50000  2.960121e+07    59.474439    7215.327081
std      17.26533  1.061579e+08    12.917107    9857.454543
min    1952.00000  6.001100e+04    23.599000     241.165877
25%    1965.75000  2.793664e+06    48.198000    1202.060309
50%    1979.50000  7.023596e+06    60.712500    3531.846989
75%    1993.25000  1.958522e+07    70.845500    9325.462346
max    2007.00000  1.318683e+09    82.603000  113523.132900


In [64]:
# A life expectancy dataset

url='http://bit.ly/2cLzoxH'
df = pd.read_csv(url)
print(df)
print(df.describe())
print('\nLife Expectancy Median ',df['lifeExp'].median())

          country  year         pop continent  lifeExp   gdpPercap
0     Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1     Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2     Afghanistan  1962  10267083.0      Asia   31.997  853.100710
3     Afghanistan  1967  11537966.0      Asia   34.020  836.197138
4     Afghanistan  1972  13079460.0      Asia   36.088  739.981106
...           ...   ...         ...       ...      ...         ...
1699     Zimbabwe  1987   9216418.0    Africa   62.351  706.157306
1700     Zimbabwe  1992  10704340.0    Africa   60.377  693.420786
1701     Zimbabwe  1997  11404948.0    Africa   46.809  792.449960
1702     Zimbabwe  2002  11926563.0    Africa   39.989  672.038623
1703     Zimbabwe  2007  12311143.0    Africa   43.487  469.709298

[1704 rows x 6 columns]
             year           pop      lifeExp      gdpPercap
count  1704.00000  1.704000e+03  1704.000000    1704.000000
mean   1979.50000  2.960121e+07    59.474439    721

In [65]:
is_2007 = df['year'] == 2007                 # a boolean for filtering
df_2007 = df[is_2007]                        # df_2007 is a subset of df for 2007 
print(df_2007.shape)
df2 = df_2007[['year','country','lifeExp']]  # a new dataframe with these 3 columns
print(df2)

(142, 6)
      year             country  lifeExp
11    2007         Afghanistan   43.828
23    2007             Albania   76.423
35    2007             Algeria   72.301
47    2007              Angola   42.731
59    2007           Argentina   75.320
...    ...                 ...      ...
1655  2007             Vietnam   74.249
1667  2007  West Bank and Gaza   73.422
1679  2007          Yemen Rep.   62.698
1691  2007              Zambia   42.384
1703  2007            Zimbabwe   43.487

[142 rows x 3 columns]


In [66]:
print(df2.min())
print()
print(df2.max())

year              2007
country    Afghanistan
lifeExp         39.613
dtype: object

year           2007
country    Zimbabwe
lifeExp      82.603
dtype: object
