## Manipulating data with pandas

In [None]:
# Import libraries
import pandas as pd
import seaborn as sns

**Note:** Seaborn is a data visualisation library you will learn more about later. Here we are using it to access an example dataset.

## Iris dataset examples

In [3]:
# Load data from seaborn package
iris_df = sns.load_dataset('iris')
# Print all columns
print(iris_df.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


### Selecting columns in pandas

In [4]:
# Select only the species column
just_the_species = iris_df['species']
just_the_species.sample(5)

51     versicolor
38         setosa
64     versicolor
142     virginica
4          setosa
Name: species, dtype: object

In [7]:
# Select columns with sepal and petal information
sepal_and_petal_info = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
sepal_and_petal_info.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
13,4.3,3.0,1.1,0.1
146,6.3,2.5,5.0,1.9
14,5.8,4.0,1.2,0.2
67,5.8,2.7,4.1,1.0
47,4.6,3.2,1.4,0.2


In [8]:
# Filter for specific values in a column
small_sepal_length = iris_df[iris_df['sepal_length'] < 4.8]
small_sepal_length.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
2,4.7,3.2,1.3,0.2,setosa
47,4.6,3.2,1.4,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
22,4.6,3.6,1.0,0.2,setosa
41,4.5,2.3,1.3,0.3,setosa


## Insurance dataset examples

Below we import data from a csv file called 'insurance.csv'. The text file is found in the task folder. Make sure it is in the same directory that the notebook is saved in.

In [9]:
# Load data
insurance_df = pd.read_csv("insurance.csv")
insurance_df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

### Grouping in pandas

In [10]:
# Get people in the 30-35 age group
between_30_and_35 = insurance_df[(insurance_df['age'] > 30) & (insurance_df['age'] < 35 )]

# Print mean charges for all people in 30-35 age group
print(between_30_and_35['charges'].mean())

10839.408303333334


In [11]:
# Use the query method to get people in the 30-35 age group
between_30_and_35 = insurance_df.query("age > 30 and age < 35")

# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())

10839.408303333334


In [None]:
# Get the mean charges for each age
print(insurance_df.groupby('age')['charges'].mean())

age
18     7086.217556
19     9747.909335
20    10159.697736
21     4730.464330
22    10012.932802
23    12419.820040
24    10648.015962
25     9838.365311
26     6133.825309
27    12184.701721
28     9069.187564
29    10430.158727
30    12719.110358
31    10196.980573
32     9220.300291
33    12351.532987
34    11613.528121
35    11307.182031
36    12204.476138
37    18019.911877
38     8102.733674
39    11778.242945
40    11772.251310
41     9653.745650
42    13061.038669
43    19267.278653
44    15859.396587
45    14830.199856
46    14342.590639
47    17653.999593
48    14632.500445
49    12696.006264
50    15663.003301
51    15682.255867
52    18256.269719
53    16020.930755
54    18758.546475
55    16164.545488
56    15025.515837
57    16447.185250
58    13878.928112
59    18895.869532
60    21979.418507
61    22024.457609
62    19163.856573
63    19884.998461
64    23275.530837
Name: charges, dtype: float64


### Balance dataset examples

Below we import data from a text file called 'balance.txt'. The text file is found in the task folder. Make sure it is in the same directory that the notebook is saved in.

In [13]:
# Load data
df = pd.read_csv('balance.txt',sep=' ')

Here is how to view the top rows of the frame. The `head()` function shows the first five observations. Use this to get a glimpse of the data such as the column names and the type of data in the columns.

In [14]:
df.head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
0,12.240798,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian
1,23.283334,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian
2,22.530409,104.593,7075,514,4,71,11,Male,No,No,Asian
3,27.652811,148.924,9504,681,3,36,11,Female,No,No,Asian
4,16.893978,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian


This shows the last observations of the dataset

In [15]:
df.tail(7)

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
393,10.958612,17.316,1335,138,2,65,13,Male,No,No,African American
394,14.735482,49.794,5758,410,4,40,8,Male,No,No,Caucasian
395,8.764984,12.096,4100,307,3,32,13,Male,No,Yes,Caucasian
396,9.943838,13.364,3838,296,5,65,17,Male,No,No,African American
397,14.882078,57.872,4171,321,5,67,12,Female,No,Yes,Caucasian
398,12.001071,37.728,2525,192,1,44,13,Male,No,Yes,Caucasian
399,10.159598,18.701,5524,415,5,64,7,Female,No,No,Asian


To get the range of indices in your DataFrame, use the `DataFrame.index` attribute. This allows you to identify the valid index range for your observations. For instance, by using the index attribute as shown below, we can determine that the DataFrame index ranges from 0 to 400. Consequently, attempting to access an index outside this range, such as 450, would be invalid for this dataset.

In [16]:
df.index

RangeIndex(start=0, stop=400, step=1)

Within the domain of pandas DataFrames, the `DataFrame.columns` attribute plays a vital role in data comprehension and manipulation. This attribute functions as a catalog, providing a list of all column names within your DataFrame, represented as df.

In [17]:
df.columns

Index(['Balance', 'Income', 'Limit', 'Rating', 'Cards', 'Age', 'Education',
       'Gender', 'Student', 'Married', 'Ethnicity'],
      dtype='object')

`describe()` shows a quick statistic summary of your data. As you can see, statistics are only calculated for columns with numerical values.

In [18]:
df.describe()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education
count,400.0,400.0,400.0,400.0,400.0,400.0,400.0
mean,13.429175,45.218885,4735.6,354.94,2.9575,55.6675,13.45
std,5.669256,35.244273,2308.198848,154.724143,1.371275,17.249807,3.125207
min,3.749403,10.354,855.0,93.0,1.0,23.0,5.0
25%,9.891439,21.00725,3088.0,247.25,2.0,41.75,11.0
50%,11.779615,33.1155,4622.5,344.0,3.0,56.0,14.0
75%,15.236961,57.47075,5872.75,437.25,4.0,70.0,16.0
max,38.785123,186.634,13913.0,982.0,9.0,98.0,20.0


The `sort_values()` method in pandas organizes data rows into a structured sequence. It accepts parameters like column names and defaults to sorting in ascending order. To sort data in descending order, specify `ascending=False` in the method call.

In [19]:
df.sort_values(by='Income',ascending=False).head()

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
28,35.271011,186.634,13414,949,2,41,14,Female,No,Yes,African American
323,33.74558,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian
355,34.034656,180.682,11966,832,2,58,8,Female,No,Yes,African American
261,38.785123,180.379,9310,665,3,67,8,Female,Yes,Yes,Asian
275,30.21208,163.329,8732,636,3,50,14,Male,No,Yes,Caucasian


Selecting a single column, which yields a Series.



In [20]:
df.Rating.head(5)

0    283
1    483
2    514
3    681
4    357
Name: Rating, dtype: int64

Selecting via [ ], which slices the rows.



In [21]:
df[50:60]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
50,10.107356,36.362,5183,376,3,49,15,Male,No,Yes,African American
51,13.010768,39.705,3969,301,2,27,20,Male,No,Yes,African American
52,11.924342,44.205,5441,394,1,32,12,Male,No,Yes,Caucasian
53,9.728192,16.304,5466,413,4,66,10,Male,No,Yes,Asian
54,7.665662,15.333,1499,138,2,47,9,Female,No,Yes,Asian
55,11.454337,32.916,1786,154,2,60,8,Female,No,Yes,Asian
56,17.053691,57.1,4742,372,7,79,18,Female,No,Yes,Asian
57,18.155488,76.273,4779,367,4,65,14,Female,No,Yes,Caucasian
58,9.180797,10.354,3480,281,2,70,17,Male,No,Yes,Caucasian
59,16.424095,51.872,5294,390,4,81,17,Female,No,No,Caucasian


In [22]:
df.loc[40:50]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
40,12.029646,34.95,3327,253,3,54,14,Female,No,No,African American
41,25.291008,113.659,7659,538,2,66,15,Male,Yes,Yes,African American
42,13.123669,44.158,4763,351,2,66,13,Female,No,Yes,Asian
43,12.319976,36.929,6257,445,1,24,14,Female,No,Yes,Asian
44,12.059596,31.861,6375,469,3,25,16,Female,No,Yes,Caucasian
45,18.653661,77.38,7569,564,3,50,12,Female,No,Yes,Caucasian
46,10.805825,19.531,5043,376,2,64,16,Female,Yes,Yes,Asian
47,11.488565,44.646,4431,320,2,49,15,Male,Yes,Yes,Caucasian
48,13.433468,44.522,2252,205,6,72,15,Male,No,Yes,Asian
49,14.007633,43.479,4569,354,4,49,13,Male,Yes,Yes,African American


#### Selecting Specific Rows and Columns

Pandas' `.iloc` attribute allows for efficient selection of specific rows and columns within a DataFrame. Using `df.iloc[5:8, [1, 7]]`, for instance, selects rows 5 to 7 and columns indexed at 1 and 7. Use a semicolon to specify a range of rows and a comma to select specific columns

In [23]:
df.iloc[5:8,[1,7]]


Unnamed: 0,Income,Gender
5,80.18,Male
6,20.996,Female
7,71.408,Male


Using a single column’s values to select data. Using the example below, we want to find if there are any users who are above the age of 90.



In [24]:
df[df.Age > 90]

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
209,28.14285,151.947,9156,642,2,91,11,Female,No,Yes,African American
323,33.74558,182.728,13913,982,4,98,17,Male,No,Yes,Caucasian


In [36]:
# Selecting the 'Limit' and 'Rating' columns of the first five observations.
limit_and_columns = df[['Limit', 'Rating']]
print(limit_and_columns.head(5))

   Limit  Rating
0   3606     283
1   6645     483
2   7075     514
3   9504     681
4   4897     357


In [35]:
# Selecting the first five observations with four cards.
cards_4 = df.query("Cards == 4")
print(cards_4.head(5))

      Balance   Income  Limit  Rating  Cards  Age  Education  Gender Student  \
2   22.530409  104.593   7075     514      4   71         11    Male      No   
5   22.486178   80.180   8047     569      4   77         10    Male      No   
10  13.994990   63.095   8117     589      4   30         14    Male      No   
20   9.853100   17.700   2860     235      4   63         16  Female      No   
29  14.007770   26.813   5611     411      4   55         16  Female      No   

   Married  Ethnicity  
2       No      Asian  
5       No  Caucasian  
10     Yes  Caucasian  
20      No      Asian  
29      No  Caucasian  


In [39]:
# Sorting the observations by 'Education' and Showing users with a high education value first.
df.sort_values(by='Education',ascending=False)

Unnamed: 0,Balance,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity
60,13.297283,35.510,5198,364,2,35,20,Female,No,No,Asian
51,13.010768,39.705,3969,301,2,27,20,Male,No,Yes,African American
378,10.140785,19.349,4941,366,1,33,19,Male,No,Yes,Caucasian
247,13.774598,36.364,2220,188,3,50,19,Male,No,No,Caucasian
238,11.079748,26.532,2910,236,6,58,19,Female,No,Yes,Caucasian
...,...,...,...,...,...,...,...,...,...,...,...
284,8.298482,14.711,2047,167,2,67,6,Male,No,Yes,Caucasian
368,19.555838,89.000,5759,440,3,37,6,Female,No,No,Caucasian
224,25.262836,121.709,7818,584,4,50,6,Male,No,Yes,Caucasian
254,13.997826,36.508,6386,469,4,79,6,Female,No,Yes,Caucasian


In [None]:
# 2. Write a short explanation in the form of a comment for the following lines of code:
    
# a. df.iloc[:,:] 
# Selects all rows (:) and all columns (:) of the DataFrame.

# b. df.iloc[5:,5:] 
# Selects all rows from the 5th row onwards (inclusive) and all columns from the 5th column onwards (inclusive).

# c. df.iloc[:,0] 
# Selects all rows (:) and only the first column (0).

# d. df.iloc[9,:] 
# Selects the 10th row (index starts at 0) and all columns (:).