# Q1. Refactoring

Now we need to create the "main" block from which we'll invoke the main function. How does the if statement that we use for this looks like?

Answer:    
    
`if __name__ == '__main__'`

# Q2. Installing pytest

Next, create a folder tests and create two files. One will be the file with tests. We can name if test_batch.py.
What should be the other file?

Answer:   
    
`__init__.py`

# Q3. Writing first unit test

How many rows should be there in the expected dataframe?

Answer:    
 `2`

Code:

```python
import os,sys
sys.path.insert(1, os.path.abspath('.'))

import batch
from datetime import datetime
import pandas as pd
from deepdiff import DeepDiff

def dt(hour, minute, second=0):
    return datetime(2021, 1, 1, hour, minute, second)

def test_prepare_data():
    data = [
        (None, None, dt(1, 2), dt(1, 10)),
        (1, 1, dt(1, 2), dt(1, 10)),
        (1, 1, dt(1, 2, 0), dt(1, 2, 50)),
        (1, 1, dt(1, 2, 0), dt(2, 2, 1)),        
    ]

    columns = ['PUlocationID', 'DOlocationID', 'pickup_datetime', 'dropOff_datetime']
    categorical = ['PUlocationID', 'DOlocationID']

    df = pd.DataFrame(data, columns=columns)
    actual_dict = batch.prepare_data(df, categorical).to_dict(orient='records')
    expected_dict = [{'PUlocationID': '-1', 'DOlocationID': '-1', 'pickup_datetime': pd.Timestamp(dt(1, 2)), 'dropOff_datetime': dt(1, 10), 'duration': 8.0}, 
                     {'PUlocationID':  '1', 'DOlocationID':  '1', 'pickup_datetime': pd.Timestamp(dt(1, 2)), 'dropOff_datetime': dt(1, 10), 'duration': 8.0}]

    diff = DeepDiff(actual_dict, expected_dict, ignore_order=True, significant_digits=1)
    print(diff)
    assert 'type_changes' not in diff.keys()
    assert 'values_changed' not in diff.keys()
```

# Q4. Mocking S3 with Localstack

Adjust it for localstack. How does the command look like?

Answer:     
`aws --endpoint-url http://localhost:4566 s3 mb s3://nyc-duration`

docker-compose.yml
```
services:
  localstack:
    container_name: hw6
    image: localstack/localstack
    ports:
      - "4566:4566"
    environment:
      - SERVICES=s3
```


# Q5. Creating test data

What's the size of the file?

Answer: 
`3512`

**Code for saving dataframe to S3:**

```python
import pandas as pd
from datetime import datetime
import os

def dt(hour, minute, second=0):
    return datetime(2021, 1, 1, hour, minute, second)

data = [
    (None, None, dt(1, 2), dt(1, 10)),
    (1, 1, dt(1, 2), dt(1, 10)),
    (1, 1, dt(1, 2, 0), dt(1, 2, 50)),
    (1, 1, dt(1, 2, 0), dt(2, 2, 1)),
]

columns = ['PUlocationID', 'DOlocationID', 'pickup_datetime', 'dropOff_datetime']


df_input = pd.DataFrame(data, columns=columns)
S3_ENDPOINT_URL = os.getenv('S3_ENDPOINT_URL')
options = {'client_kwargs': {'endpoint_url': S3_ENDPOINT_URL}}
input_file = 's3://nyc-duration/test_df.parquet'

df_input.to_parquet(
    input_file,
    engine='pyarrow',
    compression=None,
    index=False,
    storage_options=options
)
```

# Q6. Finish the integration test

What's the sum of predicted durations for the test dataframe?

Answer: `69.28`


**integration_test.py:**

```python
import os,sys
sys.path.insert(1, os.path.abspath('.'))

import batch
import pandas as pd
from datetime import datetime
import os
from deepdiff import DeepDiff
from pprint import pprint


def dt(hour, minute, second=0):
    return datetime(2021, 1, 1, hour, minute, second)

os.system('python batch.py 2021 1')

actual_df = batch.read_data('test_df_{year:04d}-{month:02d}_result.parquet'.format(year=2021, month=1))
actual_dict = actual_df.to_dict(orient='records')

expected_dict = [{'ride_id':'2021/01_0', 'predicted_duration':23.052085},
                 {'ride_id':'2021/01_1', 'predicted_duration':46.236612}]

expected_df = pd.DataFrame(expected_dict)

diff = DeepDiff(actual_dict, expected_dict, significant_digits=1)


print('actual_df', actual_df)
print('\n')
print('expected_df', expected_df)
print('\n')
pprint(f'diff={diff}')

assert 'values_changed' not in diff
assert 'type_changes' not in diff

print('all good')
```