# pandas Data Structures:
We have learned about **Series**, lets learn DataFrames (2<sup>nd</sup> workhorse of pandas) to expand our concepts of Series.

* DataFrame
* Grab data (column wise)
* Grab data (raw wise)
* Grabbing an element or a sub-set of the dataframe
* Adding new column
* Deleting the column
* boolean_mask
* boolean_mask(Combine 2 conditions)
* reset_index(), set_index(), head(), tail(), info(), describe()

## DataFrame
* A very simple way to think about the DataFrame is, "bunch of Series together such as they share the same index". <br> 
* A DataFrams is a rectangular table of data that contains an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc). DataFrame has both a row and column index; it can be thought of as a dictionary of Series all sharing the same index. <br>

&#9758; *A good read for those, who are interested! ([Python for Data Analysis](http://shop.oreilly.com/product/0636920023784.do))<br>*

Let's learn **DataFrame with examples:**<br> 

In [1]:
import pandas as pd
import numpy as np

Let's create two labels/indexes:
* for rows 'r1 to r10'
* for columns 'c1 to c10'

Let's start with a simple example, using **`arange()`** and **`reshape()`** together to create a 2D array (matrix).<br>

In [2]:
index = 'r1 r2 r3 r4 r5 r6 r7 r8 r9 r10'.split()
columns = 'c1 c2 c3 c4 c5 c6 c7 c8 c9 c10'.split()

&#9989; *Use **TAB** for auto-complete and **shift + TAB**  for doc.*

In [3]:
# How the index, columns and array_2d look like!
index

['r1', 'r2', 'r3', 'r4', 'r5', 'r6', 'r7', 'r8', 'r9', 'r10']

In [6]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [7]:
array_2d = np.arange(0,100).reshape(10,10)

In [8]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [9]:
# Let's create our first DataFrame using index, columns and array_2dnow 
df = pd.DataFrame(data = array_2d, index = index, columns = columns)

In [10]:
# How the DataFrame look like!
df  # select * from df      

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


**df** is our first dataframe. <br>
We have columns, c1 to c10, and their corresponding rows, r1 to r10. <br>
Each column is actually a pandas series, sharing a common index (row labels). <br>

&#9758; Let's learn how to **Grab data** that we need, this is the most important thing we want to learn to move one!<br>

### Columns 

In [14]:
# Grabbing a single column 
df['c2']  #select c2 from df
#df.c2
# The output looks like a series, right?.
# Also returned Series have the same index as the DataFrame

r1      1
r2     11
r3     21
r4     31
r5     41
r6     51
r7     61
r8     71
r9     81
r10    91
Name: c2, dtype: int32

In [12]:
type(df['c2']) # It is a pandas Series 

pandas.core.series.Series

In [15]:
# Grabbing more than one column, pass the list of columns you need! 
df[['c1', 'c10', 'c2']]

Unnamed: 0,c1,c10,c2
r1,0,9,1
r2,10,19,11
r3,20,29,21
r4,30,39,31
r5,40,49,41
r6,50,59,51
r7,60,69,61
r8,70,79,71
r9,80,89,81
r10,90,99,91


**df.column_name (e.g. df.c1, df.c2 etc)** can be used to grab a column as well, its good to know but I don't recommend. <br> 
If you press "TAB" after df., you will see lots of available methods, its good not to get confused with these option by using df.column_name.<br>
**Let's try this once**

In [14]:
df.c5 #df['c5']

r1      4
r2     14
r3     24
r4     34
r5     44
r6     54
r7     64
r8     74
r9     84
r10    94
Name: c5, dtype: int32

### Adding new column
Lets try with "+" operation!

In [16]:
df  

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [18]:
df['new2']= 'A'  # select *, (c1 + c2) as new from df
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,1,A
r2,10,11,12,13,14,15,16,17,18,19,21,A
r3,20,21,22,23,24,25,26,27,28,29,41,A
r4,30,31,32,33,34,35,36,37,38,39,61,A
r5,40,41,42,43,44,45,46,47,48,49,81,A
r6,50,51,52,53,54,55,56,57,58,59,101,A
r7,60,61,62,63,64,65,66,67,68,69,121,A
r8,70,71,72,73,74,75,76,77,78,79,141,A
r9,80,81,82,83,84,85,86,87,88,89,161,A
r10,90,91,92,93,94,95,96,97,98,99,181,A


### Deleting the column -- `drop()`

        *df.drop('new')-- ValueError: labels ['new'] not contained in axis

Shift+tab, you see the default axis is 0, which refers to the index (row labels), for column, we need to specify axis = 1.<br>
&#9758; rows refer to 0 axis and columns refers to 1 axis<br> 
&#9758; Quick Check: *df.shape gives tuple (rows, cols) at [0] and [1]*

In [19]:
# We can delete a column using drop()
# df.drop('new')# ValueError: labels ['new'] not contained in axis
df.drop('new2', axis=1)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new
r1,0,1,2,3,4,5,6,7,8,9,1
r2,10,11,12,13,14,15,16,17,18,19,21
r3,20,21,22,23,24,25,26,27,28,29,41
r4,30,31,32,33,34,35,36,37,38,39,61
r5,40,41,42,43,44,45,46,47,48,49,81
r6,50,51,52,53,54,55,56,57,58,59,101
r7,60,61,62,63,64,65,66,67,68,69,121
r8,70,71,72,73,74,75,76,77,78,79,141
r9,80,81,82,83,84,85,86,87,88,89,161
r10,90,91,92,93,94,95,96,97,98,99,181


&#9758; Is the "new" really deleted? <br>
Output df and you will see "new" is still there!<br>

In [20]:
df  #select * from df where index=r2

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,new,new2
r1,0,1,2,3,4,5,6,7,8,9,1,A
r2,10,11,12,13,14,15,16,17,18,19,21,A
r3,20,21,22,23,24,25,26,27,28,29,41,A
r4,30,31,32,33,34,35,36,37,38,39,61,A
r5,40,41,42,43,44,45,46,47,48,49,81,A
r6,50,51,52,53,54,55,56,57,58,59,101,A
r7,60,61,62,63,64,65,66,67,68,69,121,A
r8,70,71,72,73,74,75,76,77,78,79,141,A
r9,80,81,82,83,84,85,86,87,88,89,161,A
r10,90,91,92,93,94,95,96,97,98,99,181,A


To delete the column, you have to tell the pandas by setting<br>
* ***inplace = True*** (default is inplace=False).<br>

&#9989; *pandas is generous, it does not want us to lose the information by any mistake and needs inplace*

In [23]:
df.drop('new',axis = 1, inplace = True)
df

KeyError: "['new'] not found in axis"

### Rows
We can retrieve a row by its name or position with **[`loc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.loc.html)** and **[`iloc`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.iloc.html)**.<br>
**loc** -- Access a group of rows and columns by label(s)

In [24]:
# df['r1'] # KeyError: 'r1'
df.loc['r1'] # loc for location in square brackets
# we see that the rows are series as well!

c1     0
c2     1
c3     2
c4     3
c5     4
c6     5
c7     6
c8     7
c9     8
c10    9
Name: r1, dtype: int32

Using row's index location with **iloc**, even if our index is labeled.

In [22]:
df.iloc[4] # iloc[index], index based location

c1     40
c2     41
c3     42
c4     43
c5     44
c6     45
c7     46
c8     47
c9     48
c10    49
Name: r5, dtype: int32

In [34]:
# more than one rows -- pass a list of rows!
df.loc[['r1','r2', 'r3']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29


### Grabbing an element or a sub-set of the dataframe

In [35]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [36]:
df.loc[['r2', 'r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r2,10,11,12,13,14,15,16,17,18,19
r5,40,41,42,43,44,45,46,47,48,49


In [37]:
# df.loc(req_row, re_col) -- pass row, col for the element!
df.loc['r1','c1']

0

In [38]:
# for a sub-set, pass the list
df.loc[['r1','r2'],['c1','c2']]

Unnamed: 0,c1,c2
r1,0,1
r2,10,11


In [39]:
# another example - random columns and rows in the list 
df.loc[['r2','r5'],['c3','c4']]

Unnamed: 0,c3,c4
r2,12,13
r5,42,43


In [40]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [41]:
# We can do a conditional selection as well
df > 5
# df!=0 
# df=0

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,False,False,False,False,False,False,True,True,True,True
r2,True,True,True,True,True,True,True,True,True,True
r3,True,True,True,True,True,True,True,True,True,True
r4,True,True,True,True,True,True,True,True,True,True
r5,True,True,True,True,True,True,True,True,True,True
r6,True,True,True,True,True,True,True,True,True,True
r7,True,True,True,True,True,True,True,True,True,True
r8,True,True,True,True,True,True,True,True,True,True
r9,True,True,True,True,True,True,True,True,True,True
r10,True,True,True,True,True,True,True,True,True,True


In [42]:
df


Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


This is similar to NumPy boolean mask, lets try this:

    *bool_mask = df % 3 == 0
    *df[bool_mask]
returns values where it is True and NaN where False. 

In [44]:
# Return Divisible by 3 
bool_mask = df % 3 == 0
bool_mask
df[bool_mask]
# One step and easier to do 
# df[df % 3 == 0]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0.0,,,3.0,,,6.0,,,9.0
r2,,,12.0,,,15.0,,,18.0,
r3,,21.0,,,24.0,,,27.0,,
r4,30.0,,,33.0,,,36.0,,,39.0
r5,,,42.0,,,45.0,,,48.0,
r6,,51.0,,,54.0,,,57.0,,
r7,60.0,,,63.0,,,66.0,,,69.0
r8,,,72.0,,,75.0,,,78.0,
r9,,81.0,,,84.0,,,87.0,,
r10,90.0,,,93.0,,,96.0,,,99.0


&#9758; Its not common to use such operation on entire dataframe. We usually use them on a columns or rows instead.<br>
**For example, we don't want a row with NaN values.**<br>
What to do?<br>
Let's have a look at one example.

In [119]:
# Our original dataframe is 
df  # Select * from df where c1 > 11   

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


Let's apply a condition on column c1, say `c1 > 11`<br>
based on the conditional selection, the out put will be:

In [120]:
df['c1']>11  #[df['c1'] > 11]
#df[df['c1']>11]

r1     False
r2     False
r3      True
r4      True
r5      True
r6      True
r7      True
r8      True
r9      True
r10     True
Name: c1, dtype: bool

We don't want `r1` and `r2` as they return NaN or null values. <br>
Let's filter the rows based on condition on column values.

In [46]:
df[df['c1']>11] # df[boolean_mask]  # Select * from df where c1 > 11
# We will use such operation frequently in our course.

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9758; The above, **"`df[df['c1']>11]`"** is a dataframe with applied condition, we can select any col from this dataframe.<br> For example:

In [122]:
result = df[df['c1']>11]
result['c1']

r3     20
r4     30
r5     40
r6     50
r7     60
r8     70
r9     80
r10    90
Name: c1, dtype: int32

We can do the above operations, (filtering and selecting a columns) in a single line (stack commonds). 


In [50]:
df[df['c1']>11]['c6']
# Could be little confusing for the beginners, but don't worry, we will 
# use such operations frequently in the course as well, you will find 
# them very handy. 

r3     25
r4     35
r5     45
r6     55
r7     65
r8     75
r9     85
r10    95
Name: c6, dtype: int32

In [51]:
# let's grab two columns, we need to pass the list ['c1','c9'] here
df[df['c1']>11][['c1','c9']]  #Select c1, c9 from df where c1 > 11

Unnamed: 0,c1,c9
r3,20,28
r4,30,38
r5,40,48
r6,50,58
r7,60,68
r8,70,78
r9,80,88
r10,90,98


In [53]:
# We can do this operation on rows using loc 
# Passing multiple rows in a list
df[df['c1']>11].loc[['r3','r5']]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r3,20,21,22,23,24,25,26,27,28,29
r5,40,41,42,43,44,45,46,47,48,49


In [55]:
result = df['c1']==70 
result

r1     False
r2     False
r3     False
r4     False
r5     False
r6     False
r7     False
r8      True
r9     False
r10    False
Name: c1, dtype: bool

In [56]:
df[result]

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79


In [57]:
df[df['c1']==70]  #select * from df where c1 == 70

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r8,70,71,72,73,74,75,76,77,78,79


### Combine 2 conditions 
Let's try on c1 for a value > 60 and on c2 for a value > 80

In [58]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [59]:
df['c1']>60

r1     False
r2     False
r3     False
r4     False
r5     False
r6     False
r7     False
r8      True
r9      True
r10     True
Name: c1, dtype: bool

In [60]:
df['c2']>80

r1     False
r2     False
r3     False
r4     False
r5     False
r6     False
r7     False
r8     False
r9      True
r10     True
Name: c2, dtype: bool

In [132]:
(df['c1']>60) & (df['c2']>80)

r1     False
r2     False
r3     False
r4     False
r5     False
r6     False
r7     False
r8     False
r9      True
r10     True
dtype: bool

In [61]:
df[(df['c1']>60) & (df['c2']>80)]  # select * from df where c1>60 and c2>80
# notice (df['c1']>60)&(df['c2']>80) in () for clear saperation
# with in [] wrapped in df []

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


&#9989;**NOTE:**<br>
"and" operator will not work in the above condition and using "and" will return <br>

        *ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

This "ambiguous" means, True, only work for a single booleans at a time "True and False". We need to use "&" instead. ("|" for or)<br>
Try the above code using "and" <br>
The "and" operator gets confused with series of True/False and raise Error

### Let's have a quick look on couple of useful methods.
***We will explore more later on in the course!***

**`reset_index()`** and **`set_index()`**<br>
We can reset the index of our dataframe to numerical index (which is default index), `inplace = True` to make the permanent change. *The existing index will be a new column.*

In [62]:
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [66]:
df.reset_index(inplace = True)

In [67]:
df

Unnamed: 0,index,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
0,r1,0,1,2,3,4,5,6,7,8,9
1,r2,10,11,12,13,14,15,16,17,18,19
2,r3,20,21,22,23,24,25,26,27,28,29
3,r4,30,31,32,33,34,35,36,37,38,39
4,r5,40,41,42,43,44,45,46,47,48,49
5,r6,50,51,52,53,54,55,56,57,58,59
6,r7,60,61,62,63,64,65,66,67,68,69
7,r8,70,71,72,73,74,75,76,77,78,79
8,r9,80,81,82,83,84,85,86,87,88,89
9,r10,90,91,92,93,94,95,96,97,98,99


In [68]:
df.set_index('index', inplace = True)
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [71]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


** consider, We have a column in our data that could be a useful index,<br>
we want to set that column as an index!**<br>

In [72]:
array_2d

array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19],
       [20, 21, 22, 23, 24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35, 36, 37, 38, 39],
       [40, 41, 42, 43, 44, 45, 46, 47, 48, 49],
       [50, 51, 52, 53, 54, 55, 56, 57, 58, 59],
       [60, 61, 62, 63, 64, 65, 66, 67, 68, 69],
       [70, 71, 72, 73, 74, 75, 76, 77, 78, 79],
       [80, 81, 82, 83, 84, 85, 86, 87, 88, 89],
       [90, 91, 92, 93, 94, 95, 96, 97, 98, 99]])

In [73]:
columns

['c1', 'c2', 'c3', 'c4', 'c5', 'c6', 'c7', 'c8', 'c9', 'c10']

In [74]:
df = pd.DataFrame(data = array_2d, index = index, columns = columns)
df

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
r1,0,1,2,3,4,5,6,7,8,9
r2,10,11,12,13,14,15,16,17,18,19
r3,20,21,22,23,24,25,26,27,28,29
r4,30,31,32,33,34,35,36,37,38,39
r5,40,41,42,43,44,45,46,47,48,49
r6,50,51,52,53,54,55,56,57,58,59
r7,60,61,62,63,64,65,66,67,68,69
r8,70,71,72,73,74,75,76,77,78,79
r9,80,81,82,83,84,85,86,87,88,89
r10,90,91,92,93,94,95,96,97,98,99


In [75]:
abc = 'a b c d e f g h i j'.split() # split at white spaces
# let put newind as a col in the df
#df2 = df
df['newind']=abc
df
#df = pd.DataFrame(data=array_2d, index=index, columns=columns)

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10,newind
r1,0,1,2,3,4,5,6,7,8,9,a
r2,10,11,12,13,14,15,16,17,18,19,b
r3,20,21,22,23,24,25,26,27,28,29,c
r4,30,31,32,33,34,35,36,37,38,39,d
r5,40,41,42,43,44,45,46,47,48,49,e
r6,50,51,52,53,54,55,56,57,58,59,f
r7,60,61,62,63,64,65,66,67,68,69,g
r8,70,71,72,73,74,75,76,77,78,79,h
r9,80,81,82,83,84,85,86,87,88,89,i
r10,90,91,92,93,94,95,96,97,98,99,j


In [78]:
# setting newind as an index, needs to be inplaced
df.set_index('newind', inplace = True)

KeyError: "None of ['newind'] are in the columns"

In [80]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49
f,50,51,52,53,54,55,56,57,58,59
g,60,61,62,63,64,65,66,67,68,69
h,70,71,72,73,74,75,76,77,78,79
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


### `head()`, `tail()`

In [83]:
# Returns first n rows
df.head() # n = 5 by default 

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49


In [84]:
# Returns last n rows
df.tail(n=2) # n = 5 by default

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


### `info()`
Provides a concise summary of the DataFrame.

In [85]:
df

Unnamed: 0_level_0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
newind,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
a,0,1,2,3,4,5,6,7,8,9
b,10,11,12,13,14,15,16,17,18,19
c,20,21,22,23,24,25,26,27,28,29
d,30,31,32,33,34,35,36,37,38,39
e,40,41,42,43,44,45,46,47,48,49
f,50,51,52,53,54,55,56,57,58,59
g,60,61,62,63,64,65,66,67,68,69
h,70,71,72,73,74,75,76,77,78,79
i,80,81,82,83,84,85,86,87,88,89
j,90,91,92,93,94,95,96,97,98,99


In [86]:

df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 10 entries, a to j
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   c1      10 non-null     int32
 1   c2      10 non-null     int32
 2   c3      10 non-null     int32
 3   c4      10 non-null     int32
 4   c5      10 non-null     int32
 5   c6      10 non-null     int32
 6   c7      10 non-null     int32
 7   c8      10 non-null     int32
 8   c9      10 non-null     int32
 9   c10     10 non-null     int32
dtypes: int32(10)
memory usage: 480.0+ bytes


### `describe()`
Generates descriptive statistics that summarize the central tendency, dispersion and shape of a dataset's distribution, excluding `NaN` values.

In [87]:
df.describe()

Unnamed: 0,c1,c2,c3,c4,c5,c6,c7,c8,c9,c10
count,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0,10.0
mean,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
std,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504,30.276504
min,0.0,1.0,2.0,3.0,4.0,5.0,6.0,7.0,8.0,9.0
25%,22.5,23.5,24.5,25.5,26.5,27.5,28.5,29.5,30.5,31.5
50%,45.0,46.0,47.0,48.0,49.0,50.0,51.0,52.0,53.0,54.0
75%,67.5,68.5,69.5,70.5,71.5,72.5,73.5,74.5,75.5,76.5
max,90.0,91.0,92.0,93.0,94.0,95.0,96.0,97.0,98.0,99.0


# Excellent! 
I want to congratulate here, you are making a great progress, keep it up!