## Manipulating data with pandas

In [2]:
# Import libraries
import pandas as pd
import seaborn as sns

## Iris dataset examples

In [3]:
# Load data from seaborn package
iris_df = sns.load_dataset('iris')
# Print all columns
print(iris_df.columns)

Index(['sepal_length', 'sepal_width', 'petal_length', 'petal_width',
       'species'],
      dtype='object')


### Selecting columns in pandas

In [4]:
# Select only the species column
just_the_species = iris_df['species']
just_the_species.sample(5)

8         setosa
58    versicolor
93    versicolor
70    versicolor
78    versicolor
Name: species, dtype: object

In [5]:
# Select columns with sepal and petal information
sepal_and_petal_info = iris_df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width']]
sepal_and_petal_info.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
24,4.8,3.4,1.9,0.2
45,4.8,3.0,1.4,0.3
70,5.9,3.2,4.8,1.8
123,6.3,2.7,4.9,1.8
118,7.7,2.6,6.9,2.3


In [6]:
# Filter for specific values in a column
small_sepal_length = iris_df[iris_df['sepal_length'] < 4.8]
small_sepal_length.sample(5)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
42,4.4,3.2,1.3,0.2,setosa
38,4.4,3.0,1.3,0.2,setosa
22,4.6,3.6,1.0,0.2,setosa
29,4.7,3.2,1.6,0.2,setosa
8,4.4,2.9,1.4,0.2,setosa


## Insurance dataset examples

Below we import data from a csv file called 'insurance.csv'. The text file is found in the task folder. Make sure it is in the same directory that the notebook is saved in.

In [7]:
# Load data
insurance_df = pd.read_csv("insurance.csv")
insurance_df.columns

Index(['age', 'sex', 'bmi', 'children', 'smoker', 'region', 'charges'], dtype='object')

### Grouping in pandas

In [8]:
# Get people in the 30-35 age group
between_30_and_35 = insurance_df[(insurance_df['age'] > 30) & (insurance_df['age'] < 35 )]

# Print mean charges for all people in 30-35 age group
print(between_30_and_35['charges'].mean())

10839.408303333334


In [9]:
# Use the query method to get people in the 30-35 age group
between_30_and_35 = insurance_df.query("age > 30 and age < 35")

# Print mean charges for all people in the 30-35 age group
print(between_30_and_35['charges'].mean())

10839.408303333334


In [9]:
# Get the mean charges for each age
print(insurance_df.groupby('age')['charges'].mean())

age
18     7086.217556
19     9747.909335
20    10159.697736
21     4730.464330
22    10012.932802
23    12419.820040
24    10648.015962
25     9838.365311
26     6133.825309
27    12184.701721
28     9069.187564
29    10430.158727
30    12719.110358
31    10196.980573
32     9220.300291
33    12351.532987
34    11613.528121
35    11307.182031
36    12204.476138
37    18019.911877
38     8102.733674
39    11778.242945
40    11772.251310
41     9653.745650
42    13061.038669
43    19267.278653
44    15859.396587
45    14830.199856
46    14342.590639
47    17653.999593
48    14632.500445
49    12696.006264
50    15663.003301
51    15682.255867
52    18256.269719
53    16020.930755
54    18758.546475
55    16164.545488
56    15025.515837
57    16447.185250
58    13878.928112
59    18895.869532
60    21979.418507
61    22024.457609
62    19163.856573
63    19884.998461
64    23275.530837
Name: charges, dtype: float64


### Balance dataset examples

Below we import data from a text file called 'balance.txt'. The text file is found in the task folder. Make sure it is in the same directory that the notebook is saved in.

Here is how to view the top rows of the frame. The `head()` function shows the first five observations. Use this to get a glimpse of the data such as the column names and the type of data in the columns.

In [10]:
df.head()

NameError: name 'df' is not defined

This shows the last observations of the dataset

In [None]:
df.tail(7)

To get the range of indexes of your dataset use the syntax `dataset_name.index`. This helps you to know how to refer to your observations. By using the index function below, we know the range of the dataset is from 0-400 and therefore you cannot index an observation that is not within that range. For example, index 450 would not be a valid index for this dataset.

In [11]:
df.index

NameError: name 'df' is not defined

This allows you to see the columns in the data frame. You will need this when you are doing an analysis and are writing reports based on the dataset.

In [12]:
df.columns

NameError: name 'df' is not defined

`describe()` shows a quick statistic summary of your data. As you can see, statistics are only calculated for columns with numerical values.

In [13]:
df.describe()

NameError: name 'df' is not defined

`sort_values()` helps to arrange observations in a well ordered manner. The function will take in parameters such as column name. By default the observations will be sorted in ascending order. If you want to display data in descending order, you will have to set ascending to false.

In [14]:
df.sort_values(by='Income',ascending=False).head()

NameError: name 'df' is not defined

Selecting a single column, which yields a Series.



In [15]:
df.Rating.head(5)

NameError: name 'df' is not defined

Selecting via [ ], which slices the rows.



In [16]:
df[50:60]

NameError: name 'df' is not defined

In [17]:
df.loc[40:50]

NameError: name 'df' is not defined

#### Selection by Label

You can select a range of columns and rows for viewing. Like in the syntax below. `5:8` means 5 to 8 and `1,7` means 1 and 7. To give a range of observations, use a semicolon. To select a column use a comma. For example below we have selected column 1 and 7.

In [18]:
df.iloc[5:8,[1,7]]


NameError: name 'df' is not defined

Using a single column’s values to select data. Using the example below, we want to find if there are any users who are above the age of 90.



In [19]:
df[df.Age > 90]

NameError: name 'df' is not defined