# Introduction to Pandas

Pandas is undoubtedly very powerful and useful library for analyzing data in a fast and efficient way. This library is built on top of Numpy, so it enables you to manipulate and explore your data using high-level code. Within this notebook we´ll introduce some basic features of Pandas.

# Pandas Series

Firstly, we need to understand core data structures of Pandas and so Series and DataFrames. Let´s begin with pandas´ Series, what are they and how can be created.

As usually, we´ll import pandas library and give it an alias ´pd´ by convention.

In [1]:
# Importing pandas library
import pandas as pd

Okay, so what is a Series?? Let´s take it slowly here. I assume that you are familiar with Excel sheets. Take a look at the following image displaying example data of 8 Sweden's urban areas:

Blue arrows point out to the values that are placed at specific position that is represent by numerated rows. Similarly, you can think of Series as sheets' column. Let's see pandas Series in action. To create a one, we use `Series()` where we pass list of values, in this case strings separated by commas:

In [2]:
# Creating a Series with string values
first_series = pd.Series(['Stockholm','Uppsala','Luleå','Gävle','Falun','Lund','Göteborg','Karlstad'])
# Printing first_series variable and the type
print(first_series, '\n'*2, type(first_series))

0    Stockholm
1      Uppsala
2        Luleå
3        Gävle
4        Falun
5         Lund
6     Göteborg
7     Karlstad
dtype: object 

 <class 'pandas.core.series.Series'>


Above result is a `Series` object that is simply 1-D array comprised of elements. Left to those elements are indices (axis labels) that are automatically created and start at **0**. Under the output array is `dtype: object` which tells us that the values of Series is of object data type, i.e. strings.

There are several attributes of Series such as `index` attribute to get indices of a Series:

In [3]:
# Getting indices of a Series
first_series.index

RangeIndex(start=0, stop=8, step=1)

The output is `RangeIndex` type. This type of index is generated always when indeces are created automatically by Python. That means, when we are not explicitly defining them. 

In [4]:
first_series.values

array(['Stockholm', 'Uppsala', 'Luleå', 'Gävle', 'Falun', 'Lund',
       'Göteborg', 'Karlstad'], dtype=object)

______

- values,indeces,name of Series: adding indeces while initializing and after that
- creating a series from data stored in dict, list
- detecting NA using functions
______

# Pandas DataFrame

**image - comparison of excel sheet and DataFrame**
- description
- creating df from various data inputs (dict, list)
- adding a name to df
- displaying values and indices of df
- df reindexing `reindex()`
- setting the index `set_index()`, `reset_index()`
- adding a new column to df
- dropping columns and rows (inplace, dropping by specifying label name or index and corresponding axis)

# Selection techniques

- indexing and filtering - several ways
- the endpoint is inclusive  
- slicing columns, rows, slicing based on some condition

```
data['count']
data[0]
data[0:3]
data[data['count'] > 100]
data.ix[] - label indexing on the rows
data.loc[] - label-based or boolean array
data.iloc[] - position-based
data.query() - boolean expression
data.where() - boolean expression
```
- example:retrieving column 'capital city' = need to be accessed using indexing: df['capital city'], df.capital city won´t work because of the space)
- retrieved rows and columns are views of df not a copy - to create a copy, use `copy()`....or assign it to a new variable

# DataFrame exploration

initial exploration using:
```
data.head(), data.tail(), 5 by default
data.sample()
data.shape
data.columns
data.dtypes
data.info
```

detecting missing values: 
- NaN, None values explanation
- getting row indices with NA values

```
data.isna() = isnull()
data.notna() = notnull()
data.isna().sum()
```
detecting duplicated rows:    
```
data.duplicated() (to find duplicates on specific column/columns, use subset)

# Descriptive statistics
- all of these methods exclude NA values


```
data.describe(), include = 'object'
data.min(), data.max()
data.idxmin(), data.idxmax()
data.mean() - applied on a selected column
data.median()
data.mode()
```

data.nunique() - unique observations
data.value_counts() - counts of unique observations (ascending, normalizing)

# Loading the data   
```
pd.read_csv() - parameters, exporting data
pd.read_excel() - parameters, exporting data
```
