# Software Engineering

In part 2 of software engineering practices, you'll learn about the following practices of software engineering and how they apply in data science.

* Testing
* Logging
* Code reviews

![code](images/code-robust.png)

## Testing

Testing your code is essential before deployment. It helps you catch errors and faulty conclusions before they make any major impact. Today, employers are looking for data scientists with the skills to properly prepare their code for an industry setting, which includes testing their code.

### Typical Errors on Data Science process
* Bad encoding
* Inappropriate features - the way that the features are used does not correspond to reality.
* Unexpected features - data that break the model assumptions

> These errors are more difficult to find, because we have to check the quality of the analysis in addidtion to the quality of the code. Therefore: **TESTING**

* Problems that could occur in data science aren’t always easily detectable; you might have values being encoded incorrectly, features being used inappropriately, or unexpected data breaking assumptions.
* To catch these errors, you have to check for the quality and accuracy of your analysis in addition to the quality of your code. Proper testing is necessary to avoid unexpected surprises and have confidence in your results.
* Test-driven development (TDD): A development process in which you write tests for tasks before you even write the code to implement those tasks.
* Unit test: A type of test that covers a “unit” of code—usually a single function—independently from the rest of the program.

### Unit tests
We want to test our functions in a way that is repeatable and automated. Ideally, we'd run a test program that runs all our unit tests and cleanly lets us know which ones failed and which ones succeeded.

#### Example: nearest.py

```python
def nearest_square(num):
""" Return the nearest perfect square that is less than or equal to num."""
root = 0; 
while (root + 1) ** 2 <= num:
root +=1
return root ** 2
```

**Interactive testing of nearest.py**

```python
from nearest import nearest_square
nearest_square(5)
nearest_square(-12)
nearest_square(9)
```

**Putting code into a file to make it repeatable:**
```python
from nearest import nearest_square
print(“Nearest square <= 5: nearest_square(5))
print(“Nearest square <= -12: nearest_square(-12))
print(“Nearest square <= 9: nearest_square(9))
print(“Nearest square <= 23: nearest_square(23))
```
**Improving the file:**
```python
from nearest import nearest_square
print(“Nearest square <= 5: returned {}, correct answer is 4.”.format(nearest_square(5)))
print(“Nearest square <= -12: returned {}, correct answer is 0.”.format(nearest_square(-12)))
print(“Nearest square <= 9: returned {}, correct answer is 9.”.format(nearest_square(9)))
print(“Nearest square <= 23: returned {}, correct answer is 16.”.format(nearest_square(23)))
```
**Using assert:**
```python
from nearest import nearest_square
nearest_5 = nearest_square(5)
print(“Nearest square <= 5: returned {}, correct answer is 4.”.format(nearest_5))
assert(nearest_5 == 4)
nearest_n12 = nearest_square(-12)
print(“Nearest square <= -12: returned {}, correct answer is 0.”.format(nearest_n12))
assert(nearest_n12 == 0)
nearest_9 = nearest_square(9)
print(“Nearest square <= 9: returned {}, correct answer is 9.”.format(nearest_9))
assert(nearest_9 == 9)
nearest_23 = nearest_square(23)
print(“Nearest square <= 23: returned {}, correct answer is 16.”.format(nearest_23))
assert(nearest_23 == 16)
```
#### Unit test advantages and disadvantages
The advantage of unit tests is that they are isolated from the rest of your program, and thus, no dependencies are involved. They don't require access to databases, APIs, or other external sources of information. However, passing unit tests isn’t always enough to prove that our program is working successfully. To show that all the parts of our program work with each other properly, communicating and transferring data between them correctly, we use **integration tests**. In this lesson, we'll focus on unit tests; however, when you start building larger programs, you will want to use **integration tests** as well.<br>

#### Unit testing tools
To install `pytest`, run `pip install -U pytest` in your terminal. You can see more information on getting started [here](https://docs.pytest.org/en/latest/getting-started.html).
* Create a test file starting with test_.
* Define unit test functions that start with test_ inside the test file.
* Enter pytest into your terminal in the directory of your test file and it detects these tests for you.

`test_` is the default; if you wish to change this, you can learn how with this `pytest` [Examples and Customizations link](https://docs.pytest.org/en/latest/example/index.html?highlight=customize).<br>

In the test output, periods represent successful unit tests and Fs represent failed unit tests. Since all you see is which test functions failed, it's wise to have only one `assert` statement per test. Otherwise, you won't know exactly how many tests failed or which tests failed.<br>

Your test won't be stopped by failed `assert` statements, but it will stop if you have syntax errors.

### Integration Tests
Integration testing exercises two or more parts of an application at once, including the interactions between the parts, to determine if they function as intended. This type of testing identifies defects in the interfaces between disparate parts of a codebase as they invoke each other and pass data between themselves.


To learn more about integration testing and how integration tests relate to unit tests, see [Integration Testing](https://www.fullstackpython.com/integration-testing.html). That article contains other very useful links as well.

### Test-driven development and data science
* Test-driven development: Writing tests before you write the code that’s being tested. Your test fails at first, and you know you’ve finished implementing a task when the test passes.
* Tests can check for different scenarios and edge cases before you even start to write your function. When start implementing your function, you can run the test to get immediate feedback on whether it works or not as you tweak your function.
* When refactoring or adding to your code, tests help you rest assured that the rest of your code didn't break while you were making those changes. Tests also helps ensure that your function behavior is repeatable, regardless of external parameters such as hardware and time.

> Uncle Bob's rules about TDD (Test Driven Development)Ç
1. A test driven developer does not write a single line of code until he has written a faileing unit test. No production code can be shiped until there is a failing unit test.
2. You do not write more unit tests than it's sufficient to fail. Not compiling is failing.
3. You do not write more production code than it is sufficient to passa a failing test.

> Jim Coplien advises
1. Use TDD in a defined and consolidates architecture
2. Use TDD based on a defined problem and as a mean to achieve the solution for it.




#### "Problems" in applying *DD to Data Science & Big Data

##### Use of Notebooks and Web GUIs

1. Notebooks encourage flat structure, resulting in disorganised code
2. They are powerful interactive tools, but not powerful code editors; code inspections, refactoring tools, etc are weak
3. They encourage manual testing
4. They cannot be tracked by git (they are stored as JSON, not code)
5. Your output, your business value, has a dependency on a software environment

Always plan to deliver something that is independent of your own knowledge and toolchain.

##### Model Performance
we have a use case and that use case can determine a few desirable score-thresholds along with a minimum level of acceptable performance.

> Then you can write a test that at thresholds A, B and C say, the models precision say, is greater than the acceptable levels of A', B' and C' say.

##### Speed Tests and Slow Jobs
We have some "acceptable limit" on how long the jobs should take. Ideally have a good **CI pipeline** that automatically runs nightly tests on your master branch on a realistic cluster.<br>

Another trick which can work well for jobs that are known to downscale (i.e. work on less nodes but just slower) is to make your dev cluster much larger than your prod cluster so you can speed up your development cycle.

##### Resource Problems
This is again a real problem in Big Data. It's hard to write a test for out of memory errors, or disk space errors. The solution is again similar to the above, setup a good **CI pipeline** to automate the running of your jobs *every night* and *before release*.

> "you don't go fast by rushing you go fast by being deliberate" Uncle Bob

#### More on the topic: Testing

Test-driven development (TDD) for data science is relatively new and is experiencing a lot of experimentation and breakthroughs. You can learn more about it by exploring the following resources.

* [Data Science TDD](https://www.linkedin.com/pulse/data-science-test-driven-development-sam-savage/): Blog post about how to apply the principles of TDD to data science.
* [TDD is Essential for Good Data Science Here's Why](https://medium.com/@karijdempsey/test-driven-development-is-essential-for-good-data-science-heres-why-db7975a03a44): Explanation of why TDD is important to improve data science.
* [Testing Your Code](https://docs.python-guide.org/writing/tests/): Some general rules and practices for using TDD in Python.





## Logging
Logging is the process of recording messages to describe events that have occurred while running your software. Let's take a look at a few examples, and learn tips for writing good log messages.

### Tips

#### Be professional and clear

```
Bad: Hmmm... this isn't working???
Bad: idk.... :(
Good: Couldn't parse file.
```
#### Be concise and use normal capitalization
```
Bad: Start Product Recommendation Process
Bad: We have completed the steps necessary and will now proceed with the recommendation process for the records in our product database.
Good: Generating product recommendations.
```
#### Choose the appropriate level for logging

* **Debug**: Use this level for anything that happens in the program.
* **Error**: Use this level to record any error that occurs.
* **Info**: Use this level to record all actions that are user-driven or system-specific, such as regularly scheduled operations.

#### Provide any useful information
```
Bad: Failed to read location data
Good: Failed to read location data: store_id 8324971
```

### Logging in Python
It is easy to add logging to Python code. The first step is importing the logging module at the top of your code.
```python
import logging
```
After that, you can add messages to a logger.
```python
logging.debug('This is a debug message') # Anything that happens in the program
logging.info('This is an info message') # To record all actions that are user-driven or system specific such as regularly scheduled operations
logging.error('This is an error message') # To record any error that occurs
```
Notice that the statement for a log message is `logging.LEVEL` where *LEVEL* corresponds to the level of the log message *(debug, info, error)*, and then the message itself is passed as a parameter. There are additional levels of logging, but they are beyond the scope of this lesson.<br>

Of the three levels discussed, only the error messages will be logged by default so the output of these statements is

```
ERROR:root:This is an error message
```
The "root" in the output indicates that the default logger is being used. In order to log info and debug messages, the default level needs to be changed.

### Recap

![logging](images/logging.PNG)

![logging_level](images/logging_level.PNG)

### Additional Information
For more information on logging in Python, check out this [blog post](https://realpython.com/python-logging/)


### Example

In [None]:
import pandas as pd
import logging
df = pd.read_csv('winequality-red.csv', sep=';')
logging.debug('able to read file:'+winequality-red.csv) 
df.columns = [label.replace(' ', '_') for label in df.columns]
logging.debug('updated column headers by replacing ' ' with '_')
def numeric_to_buckets(df, column_name):
    median = df[column_name].median()
    logging.info('computed median:'+median)
    for i, val in enumerate(df[column_name]):
        if val >= median:
            logging.debug('found val >= median for val:'+val )
            df.loc[i, column_name] = 'high'
        else:
            df.loc[i, column_name] = 'low' 
for feature in df.columns[:-1]:
    numeric_to_buckets(df, feature)
    print(df.groupby(feature).quality.mean(), '\n')
logging.info('done bucketing values')

## Code Reviews
Code reviews benefit everyone in a team to promote best programming practices and prepare code for production.

They are especially helpful in:

* Catching errors
* Ensuring readability of your code
* Checking to be sure standards are being met.
* Sharing knowledge within your team

Let's go over what to look for in a code review and some tips on how to conduct one.

* [Code reviews](https://github.com/lyst/MakingLyst/tree/master/code-reviews): Check out the guidelines for conducting code reviews at Lyst.
* [Code review best practices](https://www.kevinlondon.com/2015/05/05/code-review-best-practices.html): This is a blog post about general best practices, what to look for, and how to approach code reviews.

### Reviewing code
#### Know what you need to review
Because getting PRs merged quickly is so important you need to stay on top of the changes people want your input on. You can use tools like Trailer.app or even simply searches in GitHub using "involves:<YOUR USERNAME>" to find PRs relevant to you.

#### Respond quickly
Tolerate being interrupt driven. You need focussed time to your other work done but PRs are time sensitive because they block other people. You shouldn't put off code review for more than a few hours. Never more than a day.

#### Prioritise code review highly
It's one of the most important things you can work on. The only things that should come ahead of it are time critical work such as fixing availability problems or dealing with high priority requests from customers. This means you should expect to spend some time every day doing reviews and that you will probably need to spend 2-3 sessions per day replying to PRs and reading other people's code.

#### Be thorough
Reviewing code is hard and error prone. It is our last line of defence against downtime and tech debt. You must pay attention, ask questions and not +1 lightly. Sometimes you will slip and miss something or only notice late in the review process. This not great but it is forgiven. Own up to it and move on. Confused about a bit of code? Ask what it does - *there are no stupid questions*.

#### Don't block progress
While code reviews need to be done thoughtfully and thoroughly we also need to avoid blocking progress. This means you should help people with solutions, not only identify problems. Often it might be faster to go and pair with someone for a bit to get their code tidied up instead of going back and forth on GitHub about it. Be aware that some test suites takes some time to run and it may make more sense for some fixes to be made in subsequent PRs instead of being added to this one.

### Things to look for
#### Clarity
Things should be named well and should be easy to follow when reading. The code should attempt to be self documenting.

#### Correctness
There must be unit tests. They should test the edge cases. The code should behave as the submitter described. The code should use other APIs correctly.

#### Security
The design should not introduce any security problems such as potential denial of service attacks or unintended information disclosures. In particular we should be aware of potential CSRF and XSS attacks.

#### Performance
The code should perform within our targets for a particular area. It should not use obviously suboptimal algorithms. However optimisation is usually best left to later. Except when it can also improve other areas at the same time. Simpler code is often faster.
<br>

### Questions to ask yourself when conducting a code review
Let's look over some of the questions we might ask ourselves while reviewing code. These are drawn from the concepts we've covered in these last two lessons.

#### Is the code clean and modular?
* Can I understand the code easily?
* Does it use meaningful names and whitespace?
* Is there duplicated code? (Three strikes rule)
* Can I provide another layer of abstraction?
* Is each function and module necessary?
* Is each function or module too long?

#### Is the code efficient?
* Are there loops or other steps I can vectorize?
* Can I use better data structures to optimize any steps?
* Can I shorten the number of calculations needed for any steps?
* Can I use generators or multiprocessing to optimize any steps?

#### Is the documentation effective?
* Are inline comments concise and meaningful?
* Is there complex code that's missing documentation?
* Do functions use effective docstrings?
* Is the necessary project documentation provided?

#### Is the code well tested?
* Does the code have test coverage?
* Do tests check for interesting cases?
* Are the tests readable?
* Can the tests be made more efficient?

#### Is the logging effective?
* Are log messages clear, concise, and professional?
* Do they include all relevant and useful information?
* Do they use the appropriate logging level?

### Tips for conducting a code review
The goal of code review isn't to make all code follow your personal preferences, but to ensure it meets a standard of quality for the whole team.

#### Use a code linter
This isn't really a tip for code review, but it can save you lots of time in a code review. Using a Python code linter like [pylint](https://www.pylint.org/) can automatically check for coding standards and PEP 8 guidelines for you. It's also a good idea to agree on a style guide as a team to handle disagreements on code style, whether that's an existing style guide or one you create together incrementally as a team.

#### Explain issues and make suggestions
*Suggest* changes to improve it. They will be much more receptive to your feedback if they understand your thought process and are accepting recommendations, rather than following commands.
```
BAD: Make model evaluation code its own module - too repetitive.

BETTER: Make the model evaluation code its own module. This will simplify models.py to be less repetitive and focus primarily on building models.

GOOD: How about we consider making the model evaluation code its own module? This would simplify models.py to only include code for building models. Organizing these evaluations methods into separate functions would also allow us to reuse them with different models without repeating code.
```
#### Keep your comments objective
Try to avoid using the words "I" and "you" in your comments. You want to avoid comments that sound personal to bring the attention of the review to the code and not to themselves.
```
BAD: I wouldn't groupby genre twice like you did here... Just compute it once and use that for your aggregations.

BAD: You create this groupby dataframe twice here. Just compute it once, save it as groupby_genre and then use that to get your average prices and views.

GOOD: Can we group by genre at the beginning of the function and then save that as a groupby object? We could then reference that object to get the average prices and views without computing groupby twice.
```
#### Provide code examples
When providing a code review, you can save the author time and make it easy for them to act on your feedback by writing out your code suggestions.It can also just be much quicker for you to demonstrate concepts through code rather than explanations.
<br>
Let's say you were reviewing code that included the following lines:
```python
first_names = []
last_names = []

for name in enumerate(df.name):
    first, last = name.split(' ')
    first_names.append(first)
    last_names.append(last)

df['first_name'] = first_names
df['last_names'] = last_names
```
```
BAD: You can do this all in one step by using the pandas str.split method.

GOOD: We can actually simplify this step to the line below using the pandas str.split method. Found this on this stack overflow post: https://stackoverflow.com/questions/14745022/how-to-split-a-column-into-two-columns
```
```python
df['first_name'], df['last_name'] = df['name'].str.split(' ', 1).str
```

