# Introduction to Pandas

In this section of the course we will learn how to use pandas for data analysis. You can think of pandas as an extremely powerful version of Excel, with a lot more features. In this section of the course, you should go through the notebooks in this order:

* DataTypes
    * Series
    * DataFrames
* Missing Data
* Operations
* Data Analysis
    * GroupBy
    * Merging,Joining,and Concatenating
* Data Input and Output

## 3. Operations

There are lots of operations with pandas that will be really useful to you, but don't fall into any distinct category. Let's show them here in this lecture:

In [1]:
import pandas as pd
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


**df.head()** View all columns and only first 5 rows. This allows you to view all the columns and datatypes in each column without displaying the entire dataset.

In [2]:
df.head()

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


### Info on Unique Values

**df['col'].unique()** outputs an array of the unique set of values in a column (Series object).

**df['col'].nunique()** outputs the total number of unique values in a column.

**df['col'].value_counts()** outputs a DataFrame containing the unique values as indices and the counts for each unique value in the column. 

In [3]:
df['col2'].unique()

array([444, 555, 666])

In [4]:
df['col3'].nunique()

4

In [5]:
df['col2'].value_counts()

444    2
555    1
666    1
Name: col2, dtype: int64

** Get column and index names: **

In [6]:
df.columns

Index(['col1', 'col2', 'col3'], dtype='object')

In [7]:
df.index

RangeIndex(start=0, stop=4, step=1)

### Applying a function to a column of data 

**df['col'].apply(function)** applies a function to each value in a column and outputs a Series with the values retruned from the function.



In [8]:
df

Unnamed: 0,col1,col2,col3
0,1,444,abc
1,2,555,def
2,3,666,ghi
3,4,444,xyz


In [9]:
def times2(x):
    return x*2

df['col1'].apply(times2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

In [10]:
df['col3'].apply(len) # applying the built-in-function len

0    3
1    3
2    3
3    3
Name: col3, dtype: int64

### Lambda Expressions  

Lambda expressions can be used in place of a function when functions are an input. Lambda expressions have the following format: 
**lambda x: [operation on x]**

**lamba** is a keyword that signifies a lambda expression

**x** is the input. There can be multiple inputs 'x1,x2,x3'

**[operation on x]** is the output of the expression. There is where the intended operations are preformed.

We are going to use lamba expression in place of the function times2 to achieve the same output.

In [11]:
df['col1'].apply(lambda x: x*2)

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

This next example returns True if the value is greater than one standard devation above the mean value. In this example we use a single-line if-statement, which has the following format:

true_output **if** condition **else** false_output

In [16]:
df['col1'].apply(lambda x: df['col1'].mean() if x > (df['col1'].mean()+df['col1'].std()) else False)

0    False
1    False
2    False
3     True
Name: col1, dtype: bool

Alternatively, you can perform operations with constants directly.

In [14]:
df['col1']*2

0    2
1    4
2    6
3    8
Name: col1, dtype: int64

#### Other Math operations 
sum(), min(), max(), std(), log10(), log()  

In [12]:
df['col1'].min() 

1

### Using an Index

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary).

On a Series operations are performed based on the index. As an example let's perfom addition with 2 Series objects. Let us create two sereis, ser1 and ser2:

In [13]:
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Uruguay'])  
ser2 = pd.Series([1,2,5,4],index = ['Uruguay', 'Germany','England', 'USA'])

In [26]:
ser1

USA        1
Germany    2
USSR       3
Uruguay    4
dtype: int64

In [27]:
ser2

Uruguay    1
Germany    2
England    5
USA        4
dtype: int64

Operations are then also done based off of index:

In [28]:
ser1 + ser2

England    NaN
Germany    4.0
USA        5.0
USSR       NaN
Uruguay    5.0
dtype: float64

# Great Job!