## Pandas overview
Pandas is an open-source library that is built on top of NumPy library  
It is a Python package that offers various data structures and operations for manipulating numerical data and time series  
It is mainly popular for importing and analyzing data much easier  
Pandas is fast and it has high-performance & productivity for users  

Pandas install:  
    pip install pandas

In [3]:
import pandas as pd
pd.__version__

'2.0.3'

# Pandas data structure  
- Series
- DataFrame

## Series
- A one-dimensional labeled array capable of holding data of any type
- The axis labels are collectively called indexes

In [41]:
obj1 = pd.Series([1, 2, 3, 4])
obj1

0    1
1    2
2    3
3    4
dtype: int64

In [42]:
obj2 = pd.Series([1, 'Trinh', True, 2], index=['a', 'b', 'c', 'd'])
print(obj2)
print()
print(obj2.index)
print()
print(obj2.values)
print()
print(obj2['c'])

a        1
b    Trinh
c     True
d        2
dtype: object

Index(['a', 'b', 'c', 'd'], dtype='object')

[1 'Trinh' True 2]

True


In [43]:
obj3 = pd.Series({'Trinh': 10, 'Cong': 20, 'Thanh': 30, 'Tien': 40})
print(obj3)
print()
print(obj3.index)
print()
print(obj3.values)
print()
print(obj3[obj3 < 25])
print()
print(obj3 * 2)

Trinh    10
Cong     20
Thanh    30
Tien     40
dtype: int64

Index(['Trinh', 'Cong', 'Thanh', 'Tien'], dtype='object')

[10 20 30 40]

Trinh    10
Cong     20
dtype: int64

Trinh    20
Cong     40
Thanh    60
Tien     80
dtype: int64


In [44]:
obj4 = pd.Series({'Trinh': 10, 'Cong': 20, 'Thanh': 30, 'Tien': 40})

print('Trinh' in obj4)
print()
print(20 in obj4)
print()
print(20 in obj4.values)

True

False

True


## Data Frame
- Two-dimensional size-mutable
- Data is aligned in a tabular fashion in rows and columns
- Pandas DataFrame consists of three principal components, the data, rows, and columns

In [45]:
import pandas as pd

In [85]:
data={"name":["Bill","Tom","Tim","John","Alex","Vanessa","Kate"],      
      "score":[90,80,85,75,95,60,65],      
      "sport":["Wrestling","Football","Skiing","Swimming","Tennis",
               "Karete","Surfing"],      
      "sex":["M","M","M","M","F","F","F"]}

In [86]:
df = pd.DataFrame(data)
df

Unnamed: 0,name,score,sport,sex
0,Bill,90,Wrestling,M
1,Tom,80,Football,M
2,Tim,85,Skiing,M
3,John,75,Swimming,M
4,Alex,95,Tennis,F
5,Vanessa,60,Karete,F
6,Kate,65,Surfing,F


In [87]:
# Specific index name and column order 
df = pd.DataFrame(data, index=['one', 'two', 'three', 'four', 'five', 'six', 'seven'], columns=['name', 'sex', 'sport', 'score'])
df

Unnamed: 0,name,sex,sport,score
one,Bill,M,Wrestling,90
two,Tom,M,Football,80
three,Tim,M,Skiing,85
four,John,M,Swimming,75
five,Alex,F,Tennis,95
six,Vanessa,F,Karete,60
seven,Kate,F,Surfing,65


In [88]:
# head() and tail() method
print(df.head())
print()
print(df.tail())
print()
print(df.head(2))
print()
print(df.tail(3))


       name sex      sport  score
one    Bill   M  Wrestling     90
two     Tom   M   Football     80
three   Tim   M     Skiing     85
four   John   M   Swimming     75
five   Alex   F     Tennis     95

          name sex     sport  score
three      Tim   M    Skiing     85
four      John   M  Swimming     75
five      Alex   F    Tennis     95
six    Vanessa   F    Karete     60
seven     Kate   F   Surfing     65

     name sex      sport  score
one  Bill   M  Wrestling     90
two   Tom   M   Football     80

          name sex    sport  score
five      Alex   F   Tennis     95
six    Vanessa   F   Karete     60
seven     Kate   F  Surfing     65


In [89]:
# Add new column to exsited data
df=pd.DataFrame(data,columns=["name", "sport", "gender", "score", "age"],
                index=["one","two","three","four","five","six","seven"])
df

Unnamed: 0,name,sport,gender,score,age
one,Bill,Wrestling,,90,
two,Tom,Football,,80,
three,Tim,Skiing,,85,
four,John,Swimming,,75,
five,Alex,Tennis,,95,
six,Vanessa,Karete,,60,
seven,Kate,Surfing,,65,


In [90]:
# Get data of specific index and columns
columns = ['name', 'sport']
print(df[columns])
print()

indexs = ['two', 'three']
print(df.loc[indexs])
print()

print(df.loc['six', 'score'])
print()

print(df.loc[['two', 'four'], ['name', 'sport']])
print()

          name      sport
one       Bill  Wrestling
two        Tom   Football
three      Tim     Skiing
four      John   Swimming
five      Alex     Tennis
six    Vanessa     Karete
seven     Kate    Surfing

      name     sport gender  score  age
two    Tom  Football    NaN     80  NaN
three  Tim    Skiing    NaN     85  NaN

60

      name     sport
two    Tom  Football
four  John  Swimming



In [91]:
# Add new column with value by codition
print(df.score)
print()

df['pass'] = df.score >= 80
print(df)

one      90
two      80
three    85
four     75
five     95
six      60
seven    65
Name: score, dtype: int64

          name      sport gender  score  age   pass
one       Bill  Wrestling    NaN     90  NaN   True
two        Tom   Football    NaN     80  NaN   True
three      Tim     Skiing    NaN     85  NaN   True
four      John   Swimming    NaN     75  NaN  False
five      Alex     Tennis    NaN     95  NaN   True
six    Vanessa     Karete    NaN     60  NaN  False
seven     Kate    Surfing    NaN     65  NaN  False


In [92]:
df['age'] = [17, 18, 19, 20, 21, 22, 23]
df

Unnamed: 0,name,sport,gender,score,age,pass
one,Bill,Wrestling,,90,17,True
two,Tom,Football,,80,18,True
three,Tim,Skiing,,85,19,True
four,John,Swimming,,75,20,False
five,Alex,Tennis,,95,21,True
six,Vanessa,Karete,,60,22,False
seven,Kate,Surfing,,65,23,False
