# Pandas

##### Day 03 of 100 Days Of ML

*********************************************

#### 1. What is pandas? 

> pandas is a full-featured Python library for data analysis, manipulation, and visualization

#### 2. Read a tabular data file into pandas

"Tabular data" is just data that has been formatted as a table, with rows and columns (like a spreadsheet).

In [1]:
import pandas as pd
data = pd.read_table('http://bit.ly/chiporders')
data.head()

ImportError: No module named 'pandas'

In [None]:
help(pd.read_table)

**Note: As we see, second param in read_table function sep (separator) is set 'tab' / 'space' as default**

In [None]:
mvs = pd.read_table('http://bit.ly/movieusers')
mvs.head()

> As we see, with movies user tabular database, if we let tab/space as default saperator, it doesn't output what we want. Set second param sep by '|' instead

In [None]:
mvs = pd.read_table('http://bit.ly/movieusers', sep='|')
mvs.head()

> One more problem here, the database isn't header yet. So we wanna create header by ourself

In [None]:
header_cols = ['user_id', 'age', 'gender', 'occupation', 'zip_code']
mvs = pd.read_table('http://bit.ly/movieusers', sep='|', header=None, names=header_cols)
mvs.head()

#### 3. Select pandas Series from a Dataframe

DataFrames and Series are the two main object types in pandas for data storage: a DataFrame is like a table, and each column of the table is called a Series. You will often select a Series in order to analyze or manipulate it. 

In [None]:
ufo = pd.read_csv('http://bit.ly/uforeports') # Same with pd.read_table('http://bit.ly/uforeports', sep=',')
type(ufo)

In [None]:
ufo.head()

In [None]:
# Select series in dataframe by bracket or dot keyword like:
ufo['City'].head() # Same with ufo.City

> **Note** using dot keyword is not always work. Ex, we cannot use ufo.Colors Reported

In [None]:
# Create new column to dataframe
ufo['Location'] = ufo.City + ', ' + ufo.State
ufo.head()

#### 4. How to rename columns in pandas

In [None]:
import pandas as pd
ufo = pd.read_csv('http://bit.ly/uforeports')
ufo.head()

In [None]:
ufo.columns

- First way to change columns name

In [None]:
ufo.rename(columns={'Colors Reported' : 'Color_Reported', 'Shape Reported' : 'Shape_Reported'}, inplace=True)

In [None]:
ufo.columns

- Second way

In [None]:
new_cols =  ['City', 'Color_Reported', 'Shape_Reported', 'State', 'Time']
ufo.columns = new_cols

In [None]:
ufo.head()

- Third way

In [None]:
ufo = pd.read_csv('http://bit.ly/uforeports', names=new_cols, header=None)

In [None]:
ufo.head()

- Fouth way

In [None]:
ufo = pd.read_table('http://bit.ly/uforeports', sep=',')
ufo.columns = ufo.columns.str.replace(' ', '_')

In [2]:
ufo.head()

NameError: name 'ufo' is not defined

#### 5. How to remove columns

In [None]:
ufo = pd.read_csv('http://bit.ly/uforeports')

In [None]:
ufo.shape

In [None]:
ufo.head(2)

In [None]:
ufo.drop('Colors Reported', axis=1, inplace=True) # Drop column with axis=1, 
                                                  # inplace = True that imply to change in original dataframe

In [None]:
ufo.head(2)

In [None]:
ufo.shape

In [None]:
ufo.drop(['State', 'Time'], axis=1, inplace=True) 

In [None]:
ufo.shape

In [None]:
ufo.drop([0, 1], axis=0, inplace=True) # Drop row 0, 1 

In [None]:
ufo.shape

#### 6. Sort pandas DataFrame or Series

In [None]:
import pandas as pd
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()

In [None]:
movies.title.sort_values().head() # Same with movies['title'].sort_values().head()

In [None]:
movies.title.sort_values(ascending=False).head() # Descresing sort 

> **Note**: Sort doesn't change the value of underline data. 

In [None]:
# The title of root movies object does not change after sorting 
movies.title.head()

#### 7. Filter rows by column name

In [None]:
import pandas as pd 
movies = pd.read_csv('http://bit.ly/imdbratings')
movies.head()

How to select all the row which have duration at least 200?

##### - Long manual way to do it

In [None]:
booleans = []
for item in movies.duration:
    booleans.append(True if item > 200 else False)
is_long = pd.Series(booleans)
is_long.head()

In [None]:
movies[is_long].head()

##### - Short way (pandas way) to do it

In [None]:
movies[movies.duration > 200].head()

#### 8. Filter by multi criterias

In [None]:
movies[(movies.duration > 200) & (movies.genre == "Drama")]

#### 9. Read csv file with subset of columns

In [None]:
movies = pd.read_csv('http://bit.ly/imdbratings', usecols=['genre', 'duration'])

In [None]:
movies.columns

#### 10. iterate through series of dataframe

In [None]:
for i in movies.genre: 
    print(i)

for index, row in movies.iterrows():
    print(index, row.genre, row.duration)

#### 11. Filter by specific datatype

In [None]:
# read a dataset of alcohol consumption into a DataFrame, and check the data types
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.dtypes

In [None]:
import numpy as np
drinks.select_dtypes(include=np.number).dtypes

#### 12. How to use axis parameter in pandas

When **referring to rows or columns** with the axis parameter
- **axis 0** refers to rows
- **axis 1** refers to columns

In [2]:
import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
drinks.head()

ImportError: No module named 'pandas'

#### 13. Change the data type of a pandas Series