#Data Structures in Pandas

###Demo 2: Functionality of Pandas DataFrame

In this demo, you will be shown how to use dataFrames to represent data using Python.

##### Question 1:

Perform the following Pandas Operations

    1) From the raw data below create a data frame
    
    'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
    'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'], 
    'age': [42, 52, 36, 24, 73], 
    'preTestScore': [4, 24, 31, ".", "."],
    'postTestScore': ["25,000", "94,000", 57, 62, 70]

    2) Save the dataframe into a csv file as example.csv
    3) Read the example.csv and print the data frame
    4) Read the example.csv without column heading
    5) Read the example.csv and make the index columns as 'First Name’ and 'Last   Name'
    6) Read the first 3 rows of the dataFrame and print the dataFrame
    7) The column 'postTestScore' has "," in their values around numbers to represent thousands. Load example.csv file which ignores the default behaviour of comma while reading the 'postTestScore' column 


In [None]:
# Import the required libraries

import pandas as pd
import numpy as np

# 1.1

raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
        'last_name': ['Miller', 'Jacobson', ".", 'Milner', 'Cooze'],
        'age': [42, 52, 36, 24, 73],
        'preTestScore': [4, 24, 31, ".", "."],
        'postTestScore': ["25,000", "94,000", 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
print(df)


  first_name last_name  age preTestScore postTestScore
0      Jason    Miller   42            4        25,000
1      Molly  Jacobson   52           24        94,000
2       Tina         .   36           31            57
3       Jake    Milner   24            .            62
4        Amy     Cooze   73            .            70


In [None]:
# 1.2 Save the dataFrame into a CSV file

df.to_csv('example.csv')

In [None]:
# 1.3 Read data from CSV and print the dataFrame

df = pd.read_csv('example.csv', header=0)
print(df)

   Unnamed: 0 first_name last_name  age preTestScore postTestScore
0           0      Jason    Miller   42            4        25,000
1           1      Molly  Jacobson   52           24        94,000
2           2       Tina         .   36           31            57
3           3       Jake    Milner   24            .            62
4           4        Amy     Cooze   73            .            70


In [None]:
# 1.4 Read data from CSV without headers

df = pd.read_csv('example.csv', header=None)
print(df)

     0           1          2    3             4              5
0  NaN  first_name  last_name  age  preTestScore  postTestScore
1  0.0       Jason     Miller   42             4         25,000
2  1.0       Molly   Jacobson   52            24         94,000
3  2.0        Tina          .   36            31             57
4  3.0        Jake     Milner   24             .             62
5  4.0         Amy      Cooze   73             .             70


In [None]:
# 1.5 Read the data again with headers and make the index columns as 'First Name’ and 'Last Name'

df = pd.read_csv('example.csv', header=0, index_col=['First Name', 'Last Name'], 
                 names=['UID', 'First Name', 'Last Name', 'Age', 'Pre-Test Score', 'Post-Test Score'])
print(df)

                      UID  Age Pre-Test Score Post-Test Score
First Name Last Name                                         
Jason      Miller       0   42              4          25,000
Molly      Jacobson     1   52             24          94,000
Tina       .            2   36             31              57
Jake       Milner       3   24              .              62
Amy        Cooze        4   73              .              70


In [None]:
# 1.6 Read the first 3 rows of the dataFrame and print the dataFrame

df = pd.read_csv('example.csv', nrows=3)
print(df)

   Unnamed: 0 first_name last_name  age  preTestScore postTestScore
0           0      Jason    Miller   42             4        25,000
1           1      Molly  Jacobson   52            24        94,000
2           2       Tina         .   36            31            57


In [None]:
# 1.7
df = pd.read_csv('example.csv',  thousands=',')
print(df)

   Unnamed: 0 first_name last_name  age preTestScore  postTestScore
0           0      Jason    Miller   42            4          25000
1           1      Molly  Jacobson   52           24          94000
2           2       Tina         .   36           31             57
3           3       Jake    Milner   24            .             62
4           4        Amy     Cooze   73            .             70


##### Question 2:

    1) Read diabetes data from a CSV file
    2) Print the first 10 rows and last 5 of the dataFrame
    3) Display a summary of the data in dataFrame
    4) Display summary of only Glucose column
    5) Take a sample of the data using first 15 rows
    6) Add a new column "New_Age" in the dataframe by adding 1 to values in existing "Age" column
    7) Drop the column "SkinThickness" as it is not relevant to our analysis
    8) Filter rows where patient's age is greater than or equal to 50 since such patients have high possibility of having diabetes. Use column New_Age. Incorporate another condition for filtering rows as Outcome = 1.
    9) Sort values in the dataFrame df4 by "Glucose" in descending order

In [1]:
# Read diabetes data from a CSV file
import pandas as pd
df = pd.read_csv('diabetes.csv')


# Print the first 10 rows of the dataFrame
print(df.head(10))

   Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0            6      148             72             35    250.0  33.6   
1            1       85             66             29    300.0  26.6   
2            8      183             64              0      NaN  23.3   
3            1       89             66             23     94.0  28.1   
4            0      137             40             35    168.0  43.1   
5            5      116             74              0      NaN  25.6   
6            3       78             50             32     88.0  31.0   
7           10      115              0              0    400.0  35.3   
8            2      197             70             45    543.0  30.5   
9            8      125             96              0      0.0   0.0   

   DiabetesPedigreeFunction  Age  Outcome  
0                     0.627   50        1  
1                     0.351   31        0  
2                     0.672   32        1  
3                     0.167   2

In [2]:
# Print the last 5 rows of the dataFrame
print(df.tail())

     Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
763           10      101             76             48    180.0  32.9   
764            2      122             70             27      0.0  36.8   
765            5      121             72             23    112.0  26.2   
766            1      126             60              0      0.0  30.1   
767            1       93             70             31      0.0  30.4   

     DiabetesPedigreeFunction  Age  Outcome  
763                     0.171   63        0  
764                     0.340   27        0  
765                     0.245   30        0  
766                     0.349   47        1  
767                     0.315   23        0  


In [3]:
# Display a summary of the data in dataFrame
print(df.describe())

       Pregnancies     Glucose  BloodPressure  SkinThickness     Insulin  \
count   768.000000  768.000000     768.000000     768.000000  766.000000   
mean      3.845052  120.894531      69.105469      20.536458   81.248042   
std       3.369578   31.972618      19.355807      15.952218  116.221576   
min       0.000000    0.000000       0.000000       0.000000    0.000000   
25%       1.000000   99.000000      62.000000       0.000000    0.000000   
50%       3.000000  117.000000      72.000000      23.000000   36.500000   
75%       6.000000  140.250000      80.000000      32.000000  130.000000   
max      17.000000  199.000000     122.000000      99.000000  846.000000   

              BMI  DiabetesPedigreeFunction         Age     Outcome  
count  768.000000                768.000000  768.000000  768.000000  
mean    31.992578                  0.471876   33.240885    0.348958  
std      7.884160                  0.331329   11.760232    0.476951  
min      0.000000                  

In [4]:
# Display summary of only Glucose column
print(df.Glucose.describe())

count    768.000000
mean     120.894531
std       31.972618
min        0.000000
25%       99.000000
50%      117.000000
75%      140.250000
max      199.000000
Name: Glucose, dtype: float64


In [5]:
# Take a sample of the data using first 15 rows
df2 = df.iloc[:15]
print(df2)

    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0             6      148             72             35    250.0  33.6   
1             1       85             66             29    300.0  26.6   
2             8      183             64              0      NaN  23.3   
3             1       89             66             23     94.0  28.1   
4             0      137             40             35    168.0  43.1   
5             5      116             74              0      NaN  25.6   
6             3       78             50             32     88.0  31.0   
7            10      115              0              0    400.0  35.3   
8             2      197             70             45    543.0  30.5   
9             8      125             96              0      0.0   0.0   
10            4      110             92              0      0.0  37.6   
11           10      168             74              0      0.0  38.0   
12           10      139             80            

In [6]:
# Add a new column "New_Age" in the dataframe by adding 1 to values in existing "Age" column
df2['New_Age']=df['Age'] + 1
print(df2)

    Pregnancies  Glucose  BloodPressure  SkinThickness  Insulin   BMI  \
0             6      148             72             35    250.0  33.6   
1             1       85             66             29    300.0  26.6   
2             8      183             64              0      NaN  23.3   
3             1       89             66             23     94.0  28.1   
4             0      137             40             35    168.0  43.1   
5             5      116             74              0      NaN  25.6   
6             3       78             50             32     88.0  31.0   
7            10      115              0              0    400.0  35.3   
8             2      197             70             45    543.0  30.5   
9             8      125             96              0      0.0   0.0   
10            4      110             92              0      0.0  37.6   
11           10      168             74              0      0.0  38.0   
12           10      139             80            

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [7]:
# Drop the column "SkinThickness" as it is not relevant to our analysis

df3 = df2.drop(['SkinThickness'], axis=1)
print(df3.head())

   Pregnancies  Glucose  BloodPressure  Insulin   BMI  \
0            6      148             72    250.0  33.6   
1            1       85             66    300.0  26.6   
2            8      183             64      NaN  23.3   
3            1       89             66     94.0  28.1   
4            0      137             40    168.0  43.1   

   DiabetesPedigreeFunction  Age  Outcome  New_Age  
0                     0.627   50        1       51  
1                     0.351   31        0       32  
2                     0.672   32        1       33  
3                     0.167   21        0       22  
4                     2.288   33        1       34  


In [8]:
# Filter rows where patient's age is greater than or equal to 50 since such patients have high possibility of having diabetes. 
# Use column New_Age

df3[df3['New_Age']>= 50]

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,New_Age
0,6,148,72,250.0,33.6,0.627,50,1,51
8,2,197,70,543.0,30.5,0.158,53,1,54
9,8,125,96,0.0,0.0,0.232,54,1,55
12,10,139,80,0.0,27.1,1.441,57,0,58
13,1,189,60,846.0,30.1,0.398,59,1,60
14,5,166,72,175.0,25.8,0.587,51,1,52


In [9]:
# Incorporate another condition for filtering rows as Outcome is '1'

df4 = df3[(df3['New_Age']>= 50) & (df3['Outcome'] == 1)]
df4

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,New_Age
0,6,148,72,250.0,33.6,0.627,50,1,51
8,2,197,70,543.0,30.5,0.158,53,1,54
9,8,125,96,0.0,0.0,0.232,54,1,55
13,1,189,60,846.0,30.1,0.398,59,1,60
14,5,166,72,175.0,25.8,0.587,51,1,52


In [10]:
# Sort values in the dataFrame df4 by "Glucose" in descending order

df4.sort_values(by=['Glucose'], ascending = False)

Unnamed: 0,Pregnancies,Glucose,BloodPressure,Insulin,BMI,DiabetesPedigreeFunction,Age,Outcome,New_Age
8,2,197,70,543.0,30.5,0.158,53,1,54
13,1,189,60,846.0,30.1,0.398,59,1,60
14,5,166,72,175.0,25.8,0.587,51,1,52
0,6,148,72,250.0,33.6,0.627,50,1,51
9,8,125,96,0.0,0.0,0.232,54,1,55


##### Conclusion: This code demonstrates how to create dataFrames in pandas and use various functionality of dataFrames