# Dataframe Basics

- Dataframe is at the core of Pandas.

- Dataframe is a Data Structure used to represent tabular data.

In [1]:
# import pandas
import pandas as pd

- Create a dataframe from a **CSV** file.

- This loads data from a CSV file into a Pandas Dataframe.

- **`pd.read_csv(PATH_TO_CSV_FILE)`** returns the dataframe.

In [33]:
df = pd.read_csv('data/weather_data.csv')
print(type(df))
df

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### `shape` of the dataframe
- **`df.shape`** returns the dimensions of the dataframe.

In [4]:
df_shape = df.shape
print(df_shape)

(6, 4)


In [5]:
rows, cols = df.shape
print(rows)
print(cols)

6
4


### `head()` and `tail()`

- **`df.head(n)`** returns a dataframe with first **n** rows of `df`. 

    - If `n` is not given, it is taken as `5`.
    
- **`df.tail(n)`** returns a dataframe with last **n** rows of `df`. 

    - If `n` is not given, it is taken as `5`.

In [6]:
df.head()

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


In [7]:
df.head(3)

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow


In [8]:
df.tail()

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [9]:
df.tail(3)

Unnamed: 0,day,temperature,windspeed,event
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### Indexing and slicing

In [11]:
df[2:5]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain


### Grabbing Columns

In [13]:
# get all column names
df.columns

Index(['day', 'temperature', 'windspeed', 'event'], dtype='object')

In [15]:
# grab a single column
df['temperature']

0    32
1    35
2    28
3    24
4    32
5    31
Name: temperature, dtype: int64

In [16]:
# also valid
df.temperature

0    32
1    35
2    28
3    24
4    32
5    31
Name: temperature, dtype: int64

In [17]:
# column is of type 'Series'
type(df['windspeed'])

pandas.core.series.Series

In [18]:
# grabbing multiple columns -> pass a list -> returns a new dataframe
df[['temperature', 'windspeed']]

Unnamed: 0,temperature,windspeed
0,32,6
1,35,7
2,28,2
3,24,7
4,32,4
5,31,2


In [20]:
# also returns a dataframe
df[['windspeed']]

Unnamed: 0,windspeed
0,6
1,7
2,2
3,7
4,4
5,2


### Operations on a dataframe

In [21]:
# max temperature
temperature_col = df['temperature']
temperature_col.max()

35

In [22]:
# min
temperature_col.min()

24

In [23]:
# mean
temperature_col.mean()

30.333333333333332

In [24]:
# standard deviation
temperature_col.std()

3.8297084310253524

In [26]:
# df.describe() method -> returns all useful statistics for all numerical columns
df.describe()

Unnamed: 0,temperature,windspeed
count,6.0,6.0
mean,30.333333,4.666667
std,3.829708,2.33809
min,24.0,2.0
25%,28.75,2.5
50%,31.5,5.0
75%,32.0,6.75
max,35.0,7.0


### Conditionally selecting data

- Inside `[]`, pass in the comparision statements.

In [28]:
# windspeed above average windspeed
avg_windspeed = df['windspeed'].mean()
df[df['windspeed'] > avg_windspeed] # returns a new dataframe

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
3,1/4/2017,24,7,Snow


### Setting Index

- By default, a dataframe get index as `0, 1, 2, ...`.

In [29]:
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [30]:
df.index

RangeIndex(start=0, stop=6, step=1)

- Index can be changed using the `df.set_index(index_col)` method.

- Index can be reset to original (`RangeIndex`) using `reset_index()`

In [34]:
# setting day column as the index
df.set_index('day', inplace = True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow
1/5/2017,32,4,Rain
1/6/2017,31,2,Sunny


In [35]:
df.reset_index(inplace = True)
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### `loc` and `iloc`

- `df.loc` is used to grab a subsection of the dataframe, usually by using the **labels of the index**.

- Various ways of using `loc` are demonstrated below:

In [36]:
df.index

RangeIndex(start=0, stop=6, step=1)

In [38]:
# single label of index
df.loc[4]
# 4 -> treated as index label, not as a Python Index.

day            1/5/2017
temperature          32
windspeed             4
event              Rain
Name: 4, dtype: object

In [40]:
# list of label -> returns a dataframe
df.loc[[1,2,4]]

Unnamed: 0,day,temperature,windspeed,event
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
4,1/5/2017,32,4,Rain


In [41]:
df.loc[[2]]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow


In [42]:
# with a different index
df.set_index('day', inplace = True)
df

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow
1/5/2017,32,4,Rain
1/6/2017,31,2,Sunny


In [46]:
# slicing syntax -> 'stop' index is also included
df['1/1/2017':'1/4/2017']

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/2/2017,35,7,Sunny
1/3/2017,28,2,Snow
1/4/2017,24,7,Snow


In [48]:
df.loc[['1/1/2017', '1/6/2017']]

Unnamed: 0_level_0,temperature,windspeed,event
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1/1/2017,32,6,Rain
1/6/2017,31,2,Sunny


- `iloc` is used for **integer based** indexing.

- Uses **0 based** indexing.

- Accepts an integer, or a slice syntax, etc.

In [49]:
# reset index
df.reset_index(inplace = True)

In [50]:
# iloc -> single index
df.iloc[4]

day            1/5/2017
temperature          32
windspeed             4
event              Rain
Name: 4, dtype: object

In [52]:
df.iloc[4].windspeed

4

In [51]:
# list of indices
df.iloc[[2,4,5]]

Unnamed: 0,day,temperature,windspeed,event
2,1/3/2017,28,2,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


### Different ways of creating a Dataframe

In [53]:
# from a CSV file
df = pd.read_csv('data/weather_data.csv')
df

Unnamed: 0,day,temperature,windspeed,event
0,1/1/2017,32,6,Rain
1,1/2/2017,35,7,Sunny
2,1/3/2017,28,2,Snow
3,1/4/2017,24,7,Snow
4,1/5/2017,32,4,Rain
5,1/6/2017,31,2,Sunny


In [54]:
# from an excel file
df = pd.read_excel('data/weather_data.xlsx')
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32,6,Rain
1,2017-01-02,35,7,Sunny
2,2017-01-03,28,2,Snow


In [55]:
# from an excel file, with a custom sheet name
df = pd.read_excel('data/weather_data.xlsx', sheet_name='Sheet1')
df

Unnamed: 0,day,temperature,windspeed,event
0,2017-01-01,32,6,Rain
1,2017-01-02,35,7,Sunny
2,2017-01-03,28,2,Snow


In [56]:
# from a dictionary
# key: col name
# value: list
df = pd.DataFrame({
    'day': ['1/4/2017', '1/5/2017'],
    'temperature': [24, 32],
    'windspeed': [6, 7],
    'event': ['Rain', 'Sunny']
})
df

Unnamed: 0,day,temperature,windspeed,event
0,1/4/2017,24,6,Rain
1,1/5/2017,32,7,Sunny


- Refer to [this article](https://towardsdatascience.com/15-ways-to-create-a-pandas-dataframe-754ecc082c17) for more ways.