In [14]:
import pandas as pd

## First use: read csv files

In [15]:
df = pd.read_csv('example.csv')  # pd.DataFrrame object  Comma Separated Value

In [16]:
df

Unnamed: 0,id,name,year,GPA
0,101,John,2010,3.5
1,102,Mary,2011,3.0
2,103,Peter,2012,2.5
3,104,John,2013,3.0
4,105,Mary,2014,3.5
5,106,Peter,2015,2.5
6,107,John,2016,3.0
7,108,Mary,2013,3.5
8,109,Peter,2014,2.5
9,110,John,2015,3.0


In [21]:
df.to_numpy().mean(axis=0)

3.0

pandas uses a column based indexing. Therefore, the default way of manipualting the data is by maniuplating the columns.

In [5]:
df.columns

Index(['id', 'name', 'year', 'GPA'], dtype='object')

In [6]:
# get a single column

# when retrieving a single column, it becomes a pd.Series
df['id']

0    101
1    102
2    103
3    104
4    105
5    106
6    107
7    108
8    109
9    110
Name: id, dtype: int64

In [6]:
# multiple columns
# when retrieving multiple columns, it returns a pd.DataFrame object
df[['name', 'GPA']]

Unnamed: 0,name,GPA
0,John,3.5
1,Mary,3.0
2,Peter,2.5
3,John,3.0
4,Mary,3.5
5,Peter,2.5
6,John,3.0
7,Mary,3.5
8,Peter,2.5
9,John,3.0


## Basic Indexing

pandas DataFrame can be thought of as a simple database. With databases, we usually want to query some data. Let's see how we can query the data.

In [10]:
# 1. retrieve all rows such that the GPA is >= 3.0
df[df['GPA'] >= 3.0]

Unnamed: 0,id,name,year,GPA
0,101,John,2010,3.5
1,102,Mary,2011,3.0
3,104,John,2013,3.0
4,105,Mary,2014,3.5
6,107,John,2016,3.0
7,108,Mary,2013,3.5
9,110,John,2015,3.0


In [7]:
# multiple conditions - and
# 2. retrieve all rows such that the GPA is >= 3.0 and the id is > 103
df[(df['GPA'] >= 3.0) & (df['id'] > 103) & (df['name'] == 'John')]

Unnamed: 0,id,name,year,GPA
3,104,John,2013,3.0
6,107,John,2016,3.0
9,110,John,2015,3.0


In [12]:
# multiple conditions - or
# 3. retrieve all rows such that the GPA is >= 3.0 or the id is > 103
df[(df['GPA'] >= 3.0) | (df['id'] > 103)]

Unnamed: 0,id,name,year,GPA
0,101,John,2010,3.5
1,102,Mary,2011,3.0
3,104,John,2013,3.0
4,105,Mary,2014,3.5
5,106,Peter,2015,2.5
6,107,John,2016,3.0
7,108,Mary,2013,3.5
8,109,Peter,2014,2.5
9,110,John,2015,3.0


In [11]:
# (year < 2015 and GPA == 3.5)   or name == Mary

df[((df['year'] < 2015) & (df['GPA'] == 3.5)) | (df['name'] == "Mary")]

Unnamed: 0,id,name,year,GPA
0,101,John,2010,3.5
1,102,Mary,2011,3.0
4,105,Mary,2014,3.5
7,108,Mary,2013,3.5


In [13]:
# Indexing by integer row
df.iloc[0:5]

Unnamed: 0,name,GPA
0,John,3.5
1,Mary,3.0
2,Peter,2.5
3,John,3.0
4,Mary,3.5


## Usage with numpy

In [22]:
df[['id', 'GPA']].to_numpy()

array([[101. ,   3.5],
       [102. ,   3. ],
       [103. ,   2.5],
       [104. ,   3. ],
       [105. ,   3.5],
       [106. ,   2.5],
       [107. ,   3. ],
       [108. ,   3.5],
       [109. ,   2.5],
       [110. ,   3. ]])

4. Save to CSV

In [None]:
df.to_csv('example_output.csv', index=False)