# Part 5: Data Types and Missing Values

In [1]:
import pandas as pd
reviews = pd.read_csv("./resources/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
pd.set_option("display.max_rows", 5)

## DTypes
- Data type for a column in a <mark>DataFrame</mark> or <mark>Series</mark> is called as <mark>dtype</mark>
- <mark>dtype()</mark> returns the type of a column

In [3]:
reviews.price.dtype

dtype('float64')

In [4]:
reviews.dtypes

country        object
description    object
                ...  
variety        object
winery         object
Length: 13, dtype: object

- Columns that consist of strings entirely are given the <mark>object</mark> type
- It's possible to convert a column type using <mark>astype()</mark>

In [5]:
reviews.points.astype('float64')

0         87.0
1         87.0
          ... 
129969    90.0
129970    90.0
Name: points, Length: 129971, dtype: float64

- <mark>DataFrame</mark> and <mark>Series</mark> index have their own <mark>dtype</mark>

In [6]:
reviews.index.dtype

dtype('int64')

## Missing Data
- Entries missing values are <mark>NaN</mark> for 'Not a Number' of type <mark>float64</mark>

In [7]:
reviews[pd.isnull(reviews.country)]

Unnamed: 0,country,description,designation,points,price,province,region_1,region_2,taster_name,taster_twitter_handle,title,variety,winery
913,,"Amber in color, this wine has aromas of peach ...",Asureti Valley,87,30.0,,,,Mike DeSimone,@worldwineguys,Gotsa Family Wines 2014 Asureti Valley Chinuri,Chinuri,Gotsa Family Wines
3131,,"Soft, fruity and juicy, this is a pleasant, si...",Partager,83,,,,,Roger Voss,@vossroger,Barton & Guestier NV Partager Red,Red Blend,Barton & Guestier
...,...,...,...,...,...,...,...,...,...,...,...,...,...
129590,,"A blend of 60% Syrah, 30% Cabernet Sauvignon a...",Shah,90,30.0,,,,Mike DeSimone,@worldwineguys,Büyülübağ 2012 Shah Red,Red Blend,Büyülübağ
129900,,This wine offers a delightful bouquet of black...,,91,32.0,,,,Mike DeSimone,@worldwineguys,Psagot 2014 Merlot,Merlot,Psagot


- <mark>fillna()</mark> is useful to replace missing values

In [8]:
reviews.region_2.fillna("Unknown")

0         Unknown
1         Unknown
           ...   
129969    Unknown
129970    Unknown
Name: region_2, Length: 129971, dtype: object

- The <mark>backfill</mark> strategy fills each missing value with the first non-null value after the given record
- We can replace non-null values with <mark>replace()</mark>

In [10]:
reviews.taster_twitter_handle.replace('@kerinokeefe', '@kerino')

0            @kerino
1         @vossroger
             ...    
129969    @vossroger
129970    @vossroger
Name: taster_twitter_handle, Length: 129971, dtype: object

- <mark>replace()</mark> is handy for replacing missing data which is given some sentinel value like <mark>"Unknown"</mark>, <mark>"Undisclosed"</mark>, <mark>"Invalid"</mark>, and so on.

## Exercises

### 1

In [12]:
dtype = reviews.points.dtype
dtype

dtype('int64')

### 2

In [13]:
point_strings = reviews.points.astype('str')
point_strings

0         87
1         87
          ..
129969    90
129970    90
Name: points, Length: 129971, dtype: object

### 3

In [32]:
n_missing_prices = reviews.price.isnull().sum()
n_missing_prices

8996

## 4

In [42]:
reviews_per_region = reviews.region_1.fillna("Unknown").value_counts().sort_values(ascending=False)
reviews_per_region

region_1
Unknown        21247
Napa Valley     4480
               ...  
Geelong            1
Paestum            1
Name: count, Length: 1230, dtype: int64