### Lesson 2

In this lesson, we will cover something basic in pandas package:
1. Data structure - array, DataFrame
2. Read in csv file
3. Data type - int, string, date etc.
4. Sorting data
5. Group by

In [1]:
# import pandas library
import pandas as pd


In [2]:
# Read in a csv file
df = pd.read_csv("https://raw.githubusercontent.com/vyomshm/predicting-coronary-heart-disease-with-tensorflow-and-tensorboard/master/data/heart.csv")

In [3]:
# print out the first 5 rows
df.head()

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
0,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
3,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
4,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1


In [4]:
# get the data structure and column information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 462 entries, 0 to 461
Data columns (total 10 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   sbp        462 non-null    int64  
 1   tobacco    462 non-null    float64
 2   ldl        462 non-null    float64
 3   adiposity  462 non-null    float64
 4   famhist    462 non-null    object 
 5   typea      462 non-null    int64  
 6   obesity    462 non-null    float64
 7   alcohol    462 non-null    float64
 8   age        462 non-null    int64  
 9   chd        462 non-null    int64  
dtypes: float64(5), int64(4), object(1)
memory usage: 36.2+ KB


In [5]:
# Series vs Array vs List
type(df['age'])

pandas.core.series.Series

##### Similar
Like lists, arrays are ordered, mutable, enclosed in square brackets, and able to store non-unique items.

##### Difference
* Arrays need to be declared. Lists don't, since they are built into Python. In the examples above, you saw that lists are created by simply enclosing a sequence of elements into square brackets. Creating an array, on the other hand, requires a specific function from either the array module (i.e., array.array()) or NumPy package (i.e., numpy.array()). Because of this, lists are used more often than arrays.

* Arrays can store data very compactly and are more efficient for storing large amounts of data.

* Arrays are great for numerical operations; lists cannot directly handle math operations. For example, you can divide each element of an array by the same number with just one line of code. If you try the same with a list, you'll get an error.
 
 

In [6]:
 a = [1,2,3,4,5]
print (a+1)

TypeError: can only concatenate list (not "int") to list

In [9]:
b = df['age'][:5]
print (b)
print (b+1)

0    52
1    63
2    46
3    58
4    49
Name: age, dtype: int64
0    53
1    64
2    47
3    59
4    50
Name: age, dtype: int64


In [10]:
# Sorting data

b = df['age'][:20]
print (b)

0     52
1     63
2     46
3     58
4     49
5     45
6     38
7     58
8     29
9     53
10    60
11    40
12    17
13    15
14    53
15    46
16    49
17    53
18    62
19    59
Name: age, dtype: int64


In [12]:
b.sort_values(ascending=False)

1     63
18    62
10    60
19    59
3     58
7     58
9     53
14    53
17    53
0     52
4     49
16    49
15    46
2     46
5     45
11    40
6     38
8     29
12    17
13    15
Name: age, dtype: int64

In [13]:
b.sort_values(ascending=True)

13    15
12    17
8     29
6     38
11    40
5     45
2     46
15    46
16    49
4     49
0     52
17    53
14    53
9     53
7     58
3     58
19    59
10    60
18    62
1     63
Name: age, dtype: int64

In [16]:
# Sort by multiple columns
df[:10].sort_values(by=['typea','age'], ascending=True)

Unnamed: 0,sbp,tobacco,ldl,adiposity,famhist,typea,obesity,alcohol,age,chd
8,114,0.0,3.83,19.4,Present,49,24.86,2.49,29,0
0,160,12.0,5.73,23.11,Present,49,25.3,97.2,52,1
3,170,7.5,6.41,38.03,Present,51,31.99,24.26,58,1
2,118,0.08,3.48,32.28,Present,52,29.14,3.81,46,0
1,144,0.01,4.41,28.61,Absent,55,28.87,2.06,63,1
6,142,4.05,3.38,16.2,Absent,59,20.81,2.62,38,0
4,134,13.6,3.5,27.78,Present,60,25.99,57.34,49,1
5,132,6.2,6.47,36.21,Present,62,30.77,14.14,45,0
7,114,4.08,4.59,14.6,Present,62,23.11,6.72,58,1
9,132,0.0,5.8,30.96,Present,69,30.11,0.0,53,1


In [17]:
# Group by famhist
df.groupby(by='famhist').count()

Unnamed: 0_level_0,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age,chd
famhist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Absent,270,270,270,270,270,270,270,270,270
Present,192,192,192,192,192,192,192,192,192


In [18]:
df.groupby(by='famhist').sum()

Unnamed: 0_level_0,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age,chd
famhist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Absent,36949,889.07,1203.89,6538.24,14238,6921.13,4153.66,10764,64
Present,26958,790.6,986.14,5199.67,10296,5111.25,3720.85,9017,96


In [19]:
df.groupby(by='famhist').mean()

Unnamed: 0_level_0,sbp,tobacco,ldl,adiposity,typea,obesity,alcohol,age,chd
famhist,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Absent,136.848148,3.292852,4.458852,24.215704,52.733333,25.633815,15.383926,39.866667,0.237037
Present,140.40625,4.117708,5.136146,27.081615,53.625,26.621094,19.379427,46.963542,0.5


In [20]:
# How to create a DataFrame from scratch
df_1 = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
                              'Parrot', 'Parrot'],
                   'Max Speed': [380., 370., 24., 26.]})

df_1

Unnamed: 0,Animal,Max Speed
0,Falcon,380.0
1,Falcon,370.0
2,Parrot,24.0
3,Parrot,26.0


Try to create your own dataframe and try above steps by yourself :)