# Welcome to WQD7003 Data Analytics Lab
This code is generated for the purpose of WQD7003 module.

Created by Shier Nee Saw

Reference: Python for Data Analysis O'Reily

# Getting Started with pandas

To get started with pandas, you will need to get comfortable with its two workhorse data structures: Series and DataFrame. While they are not a universal solution for every problem, they provide a solid, easy-to-use basis for most applications.

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its index. The simplest Series is formed from only an array of data

In [None]:
import pandas as pd

obj = pd.Series([4, 7, -5, 3])
obj

0    4
1    7
2   -5
3    3
dtype: int64

The string representation of a Series displayed interactively shows the index on the left and the values on the right.

Since we did not specify an index for the data, a default one consisting of the integers 0 through N - 1 (where N is the length of the data) is created.

In [None]:
obj.values

array([ 4,  7, -5,  3])

In [None]:
obj.index

RangeIndex(start=0, stop=4, step=1)

In [None]:
# Often it will be desirable to create a Series with an index identifying each data point:

obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2

d    4
b    7
a   -5
c    3
dtype: int64

In [None]:
obj2.index

Index(['d', 'b', 'a', 'c'], dtype='object')

In [None]:
# Compared with a regular NumPy array, you can use values in the index when selecting
# single values or a set of values:

obj2['a']

-5

In [None]:
# assign value 6 to element with index d
obj2['d'] = 6
obj2

d    6
b    7
a   -5
c    3
dtype: int64

In [None]:
# access a list of elements in the Series

obj2[['c', 'a', 'd']]

c    3
a   -5
d    6
dtype: int64

In [None]:
# NumPy array operations, such as filtering with a boolean array, scalar multiplication,
# or applying math functions, will preserve the index-value link

obj2[obj2 > 0]

d    6
b    7
c    3
dtype: int64

In [None]:
obj2 * 2

d    12
b    14
a   -10
c     6
dtype: int64

In [None]:
import numpy as np

np.exp(obj2)

d     403.428793
b    1096.633158
a       0.006738
c      20.085537
dtype: float64

In [None]:
# Create a Series from it by passing the dict:
# Index = Location followed by semicolon with values

sdata = {'Ohio': 35000, 'Texas': 71000, 'Oregon': 16000, 'Utah': 5000}
obj3 = pd.Series(sdata)
obj3

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

In [None]:
# When only passing a dict, the index in the resulting Series will have the dict’s keys in
# sorted order.

states = ['California', 'Ohio', 'Oregon', 'Texas']
obj4 = pd.Series(sdata, index=states)
obj4

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

In this case, 3 values found in sdata were placed in the appropriate locations, but since no value for 'California' was found, it appears as NaN (not a number) which is considered in pandas to mark missing or NA values.


In [None]:
# To check for NaN values

pd.isnull(obj4)

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
obj4.isnull()

California     True
Ohio          False
Oregon        False
Texas         False
dtype: bool

In [None]:
# To check for not NaN values

pd.notnull(obj4)

California    False
Ohio           True
Oregon         True
Texas          True
dtype: bool

In [None]:
# now we try perform addition between obj3 and obj4

print(obj3)
print()
print(obj4)

print()
print(obj3 + obj4)

Ohio      35000
Texas     71000
Oregon    16000
Utah       5000
dtype: int64

California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
dtype: float64

California         NaN
Ohio           70000.0
Oregon         32000.0
Texas         142000.0
Utah               NaN
dtype: float64


You notice that for the resulting series is the union of both series.

Only the elements exists in both Series get to sum up. If you have a NaN value in either Series, the resulting values return NaN, which are California and Utah in this case.  

In [None]:
# Both the Series object itself and its index have a name attribute, which integrates with
# other key areas of pandas functionality:

obj4.name = 'population'
obj4.index.name = 'state'

obj4

state
California        NaN
Ohio          35000.0
Oregon        16000.0
Texas         71000.0
Name: population, dtype: float64

In [None]:
# A Series’s index can be altered in place by assignment:

print(obj)

print()
print('After assignment')
obj.index = ['Bob', 'Steve', 'Jeff', 'Ryan']
print(obj)

0    4
1    7
2   -5
3    3
dtype: int64

After assignment
Bob      4
Steve    7
Jeff    -5
Ryan     3
dtype: int64


## DataFrame

A DataFrame represents a tabular, spreadsheet-like data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.).

The DataFrame has both a row and column index; it can be thought of as a dict of Series (one for all sharing the same index). Compared with other such DataFrame-like structures you may have used before (like R’s data.frame), roworiented and column-oriented operations in DataFrame are treated roughly symmetrically.

Under the hood, the data is stored as one or more two-dimensional blocks rather than a list, dict, or some other collection of one-dimensional arrays.

In [None]:
# one of the most common is from a dict of equal-length lists or NumPy arrays

data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
        'year': [2000, 2001, 2002, 2001, 2002],
        'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}

frame = pd.DataFrame(data)
frame

Unnamed: 0,state,year,pop
0,Ohio,2000,1.5
1,Ohio,2001,1.7
2,Ohio,2002,3.6
3,Nevada,2001,2.4
4,Nevada,2002,2.9


In [None]:
# If you specify a sequence of columns, the DataFrame’s columns will be exactly what
# you pass:

pd.DataFrame(data, columns=['year', 'state', 'pop'])

Unnamed: 0,year,state,pop
0,2000,Ohio,1.5
1,2001,Ohio,1.7
2,2002,Ohio,3.6
3,2001,Nevada,2.4
4,2002,Nevada,2.9


In [None]:
# As with Series, if you pass a column that isn’t contained in data, it will appear with NA
# values in the result:

frame2 = pd.DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
                      index=['one', 'two', 'three', 'four', 'five'])

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,
five,2002,Nevada,2.9,


In [None]:
# A column in a DataFrame can be retrieved as a Series either by dict-like notation or by
# attribute:

frame2['state']

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [None]:
frame2.state

one        Ohio
two        Ohio
three      Ohio
four     Nevada
five     Nevada
Name: state, dtype: object

In [None]:
frame2.year

one      2000
two      2001
three    2002
four     2001
five     2002
Name: year, dtype: int64

In [None]:
# Columns can be modified by assignment. For example, the empty 'debt' column could
# be assigned a scalar value or an array of values:

frame2['debt'] = 16.5

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,16.5
two,2001,Ohio,1.7,16.5
three,2002,Ohio,3.6,16.5
four,2001,Nevada,2.4,16.5
five,2002,Nevada,2.9,16.5


In [None]:
import numpy as np

frame2['debt'] = np.arange(5.)

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,0.0
two,2001,Ohio,1.7,1.0
three,2002,Ohio,3.6,2.0
four,2001,Nevada,2.4,3.0
five,2002,Nevada,2.9,4.0


**Note**: When assigning lists or arrays to a column, the value’s length must match the length of the DataFrame.

If you assign a Series, it will be instead conformed exactly to the DataFrame’s index, inserting missing values in any holes

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=['two', 'four', 'five'])

frame2['debt'] = val

frame2

Unnamed: 0,year,state,pop,debt
one,2000,Ohio,1.5,
two,2001,Ohio,1.7,-1.2
three,2002,Ohio,3.6,
four,2001,Nevada,2.4,-1.5
five,2002,Nevada,2.9,-1.7


In [None]:
# Assigning a column that doesn’t exist will create a new column

frame2['eastern'] = frame2.state == 'Ohio'

frame2

Unnamed: 0,year,state,pop,debt,eastern
one,2000,Ohio,1.5,,True
two,2001,Ohio,1.7,-1.2,True
three,2002,Ohio,3.6,,True
four,2001,Nevada,2.4,-1.5,False
five,2002,Nevada,2.9,-1.7,False


In [None]:
# del keyword will delete columns as with a dict:

del frame2['eastern']
frame2.columns

Index(['year', 'state', 'pop', 'debt'], dtype='object')

In [None]:
# Another common form of data is a nested dict of dicts format:
# If passed to DataFrame, it will interpret the outer dict keys as the columns and the inner
# keys as the row indices:

pop = {'Nevada': {2001: 2.4, 2002: 2.9},
       'Ohio': {2000: 1.5, 2001: 1.7, 2002: 3.6}}

frame3 = pd.DataFrame(pop)

frame3

Unnamed: 0,Nevada,Ohio
2001,2.4,1.7
2002,2.9,3.6
2000,,1.5


In [None]:
# Of course you can always transpose the result:

frame3.T

Unnamed: 0,2001,2002,2000
Nevada,2.4,2.9,
Ohio,1.7,3.6,1.5


In [None]:
# Like Series, the values attribute returns the data contained in the DataFrame as a 2D
# ndarray:

frame3.values

array([[2.4, 1.7],
       [2.9, 3.6],
       [nan, 1.5]])

In [None]:
# If the DataFrame’s columns are different dtypes, the dtype of the values array will be
# chosen to accomodate all of the columns:

frame2.values

array([[2000, 'Ohio', 1.5, nan],
       [2001, 'Ohio', 1.7, -1.2],
       [2002, 'Ohio', 3.6, nan],
       [2001, 'Nevada', 2.4, -1.5],
       [2002, 'Nevada', 2.9, -1.7]], dtype=object)

### Index Objects

pandas’s Index objects are responsible for holding the axis labels and other metadata (like the axis name or names).

Any array or other sequence of labels used when constructing a Series or DataFrame is internally converted to an Index:

In [None]:
obj = pd.Series(range(3), index=['a', 'b', 'c'])

index = obj.index

index

Index(['a', 'b', 'c'], dtype='object')

In [None]:
# Index objects are immutable and thus can’t be modified by the user:
index[1] = 'd'

TypeError: Index does not support mutable operations

Immutability is important so that Index objects can be safely shared among data structures.

### Dropping entries from an axis

Dropping one or more entries from an axis is easy if you have an index array or list without those entries.

As that can require a bit of munging and set logic, the drop method will return a new object with the indicated value or values deleted from an axis:

In [None]:
obj = pd.Series(np.arange(5.), index=['a', 'b', 'c', 'd', 'e'])
obj

In [None]:
new_obj = obj.drop('c')
new_obj

In [None]:
obj.drop(['d', 'c'])

In [None]:
# With DataFrame, index values can be deleted from either axis:
data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                   index=['Ohio', 'Colorado', 'Utah', 'New York'],
                   columns=['one', 'two', 'three', 'four'])

data.drop(['Colorado', 'Ohio'])

In [None]:
data.drop('two', axis=1)

In [None]:
data.drop(['two', 'four'], axis=1)

### Indexing, selection and filtering

Series indexing (obj[...]) works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers

In [None]:
obj = pd.Series(np.arange(4.), index=['a', 'b', 'c', 'd'])
obj

In [None]:
# return the values with index 'b'
obj['b']

In [None]:
# return the second values. Python index starts with zero.
obj[1]

In [None]:
# return the third to forth values
obj[2:4]

In [None]:
# return the values with index ['b', 'a', 'd']
obj[['b', 'a', 'd']]

In [None]:
# return the second and forth values. Python index starts with zero.
obj[[1, 3]]

In [None]:
# return the value as long as it is less than 2
obj[obj < 2]

Try to change the number of the indexing and see what happen.

In [None]:
# Slicing with labels behaves differently than normal Python slicing in that the endpoint
# is inclusive:

obj['b':'c']

In [None]:
# Setting using these methods works just as you would expect

obj['b':'c'] = 5
obj

In [None]:
# As you’ve seen above, indexing into a DataFrame is for retrieving one or more columns
# either with a single value or sequence:

data = pd.DataFrame(np.arange(16).reshape((4, 4)),
                    index=['Ohio', 'Colorado', 'Utah', 'New York'],
                    columns=['one', 'two', 'three', 'four'])

data

In [None]:
data['two']

In [None]:
data[['three', 'one']]

In [None]:
# Selecting rows by slicing

data[:2]

In [None]:
# Selecting rows by boolean
# Return value when column three is larger than 5

data[data['three'] > 5]

In [None]:
# return boolean when data <5

data < 5

In [None]:
# Return value when values <5

data[data < 5]

In [None]:
# assign value 0 for value less than 5

data[data < 5] = 0

data

### Arithmetic and data alignment

One of the most important pandas features is the behavior of arithmetic between objects with different indexes.

When adding together objects, if any index pairs are not the same, the respective index in the result will be the union of the index pairs.

In [None]:
s1 = pd.Series([7.3, -2.5, 3.4, 1.5], index=['a', 'c', 'd', 'e'])
s2 = pd.Series([-2.1, 3.6, -1.5, 4, 3.1], index=['a', 'c', 'e', 'f', 'g'])

In [None]:
s1

In [None]:
s2

In [None]:
s1 + s2

The internal data alignment introduces NA values in the indices that don’t overlap.

Missing values propagate in arithmetic computations.

#### Arithmetic methods with fill values

In arithmetic operations between differently-indexed objects, you might want to fill with a special value, like 0, when an axis label is found in one object but not the other.

Arithmetic method:
*  add
*  sub
*  div
*  mul

In [None]:
df1 = pd.DataFrame(np.arange(12.).reshape((3, 4)), columns=list('abcd'))
df2 = pd.DataFrame(np.arange(20.).reshape((4, 5)), columns=list('abcde'))

In [None]:
df1

In [None]:
df2

In [None]:
df1 + df2

Adding these together results in NA values in the locations that don’t overlap:

In [None]:
# Using the add method on df1, I pass df2 and an argument to fill_value:

df1.add(df2, fill_value=0)

In [None]:
df1.sub(df2, fill_value=0)

# Exercise
1. Create a pandas Series from this dict: {'a': 10, 'b': 20, 'c': 30}.
2. Retrieve the value at index label 'b' from Q1 Series.
3. Add the following two Series: Series1 = {'a': 10, 'b': 20, 'c': 30} and Series2 = {'b': 30, 'c': 40, 'd': 50}. What happens to the missing values?
4. Create a DataFrame from a dictionary where keys are column names and values are lists of data: {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35], 'City': ['New York', 'Los Angeles', 'Chicago']}.
5. Retrieve the 'Age' column from the DataFrame created in Q4.
6. Add a new column 'Gender' with values ['Female', 'Male', 'Male'] to the DataFrame created in Q4.
7. Filter the DataFrame created in Q4 to select rows where 'Age' is greater than 28.
8. Sort the DataFrame created in Q4 by the 'Age' column in ascending order.
9. Group the DataFrame created in Q4 by the 'City' column and calculate the average age in each city.


In [None]:
# Your solution here

### Submission: File > Print > As PDF > Submit in ODL Platform
### Make sure the answer is visible in PDF format.
### Deadline: 1 week after today class.