# ‚≠ê Getting to grips with testing 

There are different levels of testing
- Assertions: ü¶Ñ == ü¶Ñ
- Exceptions: (within the code) serve as warnings ‚ö†Ô∏è
- Unit tests: investigate the behaviour of units of code (e.g functions)
- Regression tests: defends against üêõ
- Integration tests: ‚öôÔ∏è checks that the pieces work together as expected


## Assertions
Evaluate an expression which it hopes to be true and if this is false it will raise an exception of  the type `AssertionError`

In [None]:
# Basic example: function to identify if a given number is a multiple of 5 

In [None]:
def isMultiple(number):
    is_fact = False
    
    if number % 5 == 0:
        is_fact = True
    return is_fact

print(isMultiple(15))

It works so now it is your problem right?

<img src="./assets/devproblem.jpeg">

Not quite!

In [None]:
print(isMultiple(0))

In [None]:
import pytest

def test_isMultiple():
    assert isMultiple(5) == True
    assert isMultiple(50) == True
    assert isMultiple(0) == False
    assert isMultiple(-10) == False
    
test_isMultiple()

In [None]:
def isMultiple(number):
    is_fact = False
    
    if number > 0 and number % 5 == 0:
        is_fact = True
    return is_fact

print(isMultiple(15))

In [None]:
test_isMultiple()

## Exceptions
Catch bugs before they are actually bugs. 

Imagine we want to load a data set...

In [None]:
import pandas as pd

df =  pd.read_csv('winemag-data-130k-v2.csv')


In [None]:
try: 
    df = pd.read_csv('winemag-data-130k-v2.csv')
except:
    df = pd.read_csv('https://raw.githubusercontent.com/trallard/TestingData/master/data/winemag-data-130k-v2.csv',
                    index_col = 0)

In [None]:
df.head()

# ‚≠ê Data validation 

# Schema validation
There are a number of Python libraries to achieve this: [Schema](https://github.com/keleshev/schema), [Voluptuous](https://github.com/alecthomas/voluptuous) and [Cerberus](http://docs.python-cerberus.org/en/stable/) are some of the most commonly used. 

We will use Voluptuous, start by doing a `pip install voluptuous`

What does schema validation mean?
It refers to checking all the fields are there and all the types are right or understandable (parseable)

**Let's start by checking the data types of our columns**

In [None]:
df.dtypes

We can now define a toy example of a given schema

In [None]:
from voluptuous import Schema

s = Schema({
    'q': str,
    'per_page': int,
    'page': int,
})

s({"q": "hello"})

In [None]:
s({"q": "hello", "page": "world"})

Now let's generate a schema for our own data set

In [None]:
 df_schema = Schema({
     'country':str,
     'description':str,
     'designation':str,
     'price':int,
     'province':str,
     'taster_name':str,
 }, extra=True)

And we will use only one small sample of the data 

In [None]:
df_sample = dict(df.loc[1])
print(df_sample)

In [None]:
df_schema(df_sample)

## Check for missing numbers and duplicates

Let's start with the duplicates:

In [None]:
print("Total number of examples: ", df.shape[0])
print("Number of examples with the same title and description: ", 
      df[df.duplicated(['description','title'])].shape[0])

duplicated_rows = df[df.duplicated(['description'])].shape[0]
print("Number of examples with the same description: ", duplicated_rows  )

I am going to create a copy of the dataframe to do my manipulations... starting with dropping the duplicates.

In [None]:
interim = df.copy()
interim.drop_duplicates(subset = 'description', inplace = True)

no_rows_interim = len(interim)
print('Total unique reviews:', no_rows_interim)
print('\nVariety description \n', interim['variety'].describe())

### What if I want to do some checks on my data?

Let's check if our dataframe has the number of rows it is expected to have

In [None]:
assert no_rows_interim == (len(df) - duplicated_rows)

Now, let's have a look at our missing numbers:

In [None]:
total = df.isnull().sum().sort_values(ascending = False)
percent = (df.isnull().sum()/df.isnull().count()*100).sort_values(ascending = False)
missing_data  = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data

The most missing values are in region, destination, tester name and price columns.

I'm worried the most about wines with NaN in price columns. We don't want to predict points for wines which price are undeclared. We will drop rows with NaN value in this column.

In [None]:
interim = interim.dropna(subset=['price'])
interim = interim.reset_index(drop = True)

interim.head()

In [None]:
interim.shape

Doing more data transformations... we only want to keep the top 20 wines

In [None]:
varieties = interim['variety'].value_counts()
varieties

In [None]:
top_wines_df = interim.loc[interim['variety'].isin(varieties.axes[0][:20])].to_json

I have been very careful with all my data manipulation right?

![](./assets/data.jpeg)


In [None]:
assert isinstance(interim, pd.DataFrame)
assert isinstance(top_wines_df, pd.DataFrame)

Assume you have a **text data frame** and you want to verify that your newly processed data conforms to it. Pandas has its own tesing module 

In [None]:
fixture = pd.read_csv('./data/fixture.csv', index_col =None)

In [None]:
pd.testing.assert_frame_equal(top_wines_df, fixture)
pd.testing.assert_index_equal(top_wines_df, fixture)

In [None]:
top_wines_df.reset_index(drop =True, inplace=True)

In [None]:
pd.testing.assert_index_equal(top_wines_df.index, fixture.index)

# Some property bases testing with hypothesis 