Pandas is the shorthand for 'Python and Data Analysis'. It provides a rich set of features for exploring and manipulating data, making it the go-to toolkit for a lot of data scientists.

In [1]:
import numpy as np
import pandas as pd
print(pd.__version__)

1.1.5


Creating some pandas series...

In [2]:
ser_a = pd.Series([1, 2, 3, 4], index=["a", "b", "c", "d"])
ser_b = pd.Series([1, 2, 3, 4], index=["b", "a", "c", "d"])

ser_a + ser_b

a    3
b    3
c    6
d    8
dtype: int64

... doing some element-wise operations

In [3]:
ser_a + ser_b
ser_a - ser_b
ser_a * ser_b
ser_a / ser_b

a    0.5
b    2.0
c    1.0
d    1.0
dtype: float64

... doing some aggregated operations 

In [4]:
ser_c = pd.Series([1, np.nan, 3, 4], index=["a", "b", "c", "d"])
print(ser_c.dtype)               # Which dtype has `ser_c`?

ser_c.count()             # => 3
ser_c.sum()               # => 8
ser_c.mean()              # => 2.67
ser_c.mean(skipna=False)  # => nan
ser_c.max()               # => 4
ser_c.min()               # => 1
ser_c.idxmax()            # => "d"

ser_d = pd.Series([1, "a", 3, 4], index=["a", "b", "c", "d"])
print(ser_d.dtype)        # which dtype has `ser_d`?

ser_e = pd.Series([1, 1, 1, np.nan, 3, 4])
ser_e

float64
object


0    1.0
1    1.0
2    1.0
3    NaN
4    3.0
5    4.0
dtype: float64

Cheking unique values

In [5]:
ser_e.unique() # => [ 1., nan,  3.,  4.]
ser_e.nunique() # => 3
ser_e.value_counts()

1.0    3
4.0    1
3.0    1
dtype: int64

Checking null values

In [6]:
ser_e.isna()

0    False
1    False
2    False
3     True
4    False
5    False
dtype: bool

In [7]:
ser_e.notna()

0     True
1     True
2     True
3    False
4     True
5     True
dtype: bool

In [8]:
ser_e.dropna()

0    1.0
1    1.0
2    1.0
4    3.0
5    4.0
dtype: float64

In [9]:
ser_e.fillna(ser_e.mean())

0    1.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
dtype: float64

In [10]:
ser_e.fillna(method="ffill")

0    1.0
1    1.0
2    1.0
3    1.0
4    3.0
5    4.0
dtype: float64

In [11]:
ser_e.fillna(method="bfill")

0    1.0
1    1.0
2    1.0
3    3.0
4    3.0
5    4.0
dtype: float64

Creating pandas dataframes

In [12]:
# create a dataframe with random data
import random
random.seed(3)
df = pd.DataFrame([[random.randint(0, 9) for i in range(10)] for i in range(5)],
                  index=[i for i in range(5)], 
                  columns=list('abcdefghij'))
df

Unnamed: 0,a,b,c,d,e,f,g,h,i,j
0,3,9,8,2,5,9,7,9,1,9
1,0,7,4,8,3,3,7,8,8,7
2,6,2,3,2,8,6,0,1,2,9
3,0,4,0,4,7,9,6,6,6,9
4,7,2,5,1,0,2,7,3,4,6


Doing some operations with pandas dataframes

In [13]:
# Select a column (i.e., a series)
df['a']

# Add another column
df['k'] = df['a'] * df['b']

# Get the headers (i.e., the column names)
df.columns

# Get just the first two rows
df.head(2)

# Get just the last two rows
df.tail(2)

# Sort the dataframe by columns
df.sort_values(by=['a', 'b'], ascending=[False, True])

# To get some statistics (e.g., count, mean, std, min, etc.)
df.describe()

Unnamed: 0,a,b,c,d,e,f,g,h,i,j,k
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,3.2,4.8,4.0,3.4,4.6,5.8,5.4,5.4,4.2,8.0,10.6
std,3.271085,3.114482,2.915476,2.792848,3.209361,3.271085,3.04959,3.361547,2.863564,1.414214,11.260551
min,0.0,2.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,6.0,0.0
25%,0.0,2.0,3.0,2.0,3.0,3.0,6.0,3.0,2.0,7.0,0.0
50%,3.0,4.0,4.0,2.0,5.0,6.0,7.0,6.0,4.0,9.0,12.0
75%,6.0,7.0,5.0,4.0,7.0,9.0,7.0,8.0,6.0,9.0,14.0
max,7.0,9.0,8.0,8.0,8.0,9.0,7.0,9.0,8.0,9.0,27.0


Importing the data from a csv file.
Check also: https://www.kaggle.com/camnugent/california-housing-prices

The data contains information from the 1990 California census. So although it may not help you with predicting current housing prices like the Zillow Zestimate dataset, it does provide an accessible introductory dataset for teaching people about the basics of machine learning.

From https://www.oreilly.com/library/view/hands-on-machine-learning/9781492032632/
"This data has metrics such as the population, median income, median housing price, and so on for each block group in California. Block groups are the smallest geographical unit for which the US Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). We will just call them “districts”
for short. Your model should learn from this data and be able to predict the median housing price in any district, given all the other metrics.

In [14]:
df = pd.read_csv("../datasets/housing.csv", delimiter=",")
df

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


In [15]:
df.describe()

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value
count,20640.0,20640.0,20640.0,20640.0,20433.0,20640.0,20640.0,20640.0,20640.0
mean,-119.569704,35.631861,28.639486,2635.763081,537.870553,1425.476744,499.53968,3.870671,206855.816909
std,2.003532,2.135952,12.585558,2181.615252,421.38507,1132.462122,382.329753,1.899822,115395.615874
min,-124.35,32.54,1.0,2.0,1.0,3.0,1.0,0.4999,14999.0
25%,-121.8,33.93,18.0,1447.75,296.0,787.0,280.0,2.5634,119600.0
50%,-118.49,34.26,29.0,2127.0,435.0,1166.0,409.0,3.5348,179700.0
75%,-118.01,37.71,37.0,3148.0,647.0,1725.0,605.0,4.74325,264725.0
max,-114.31,41.95,52.0,39320.0,6445.0,35682.0,6082.0,15.0001,500001.0
