# Software Engineering & Data Science

This is a course about:

- Modularity
    - improve readability
    - improve maintainability
    - solve problems once
        - packages, classes and methods (in Python)
- Documentation 
    - show people how to use your project
    - prevent confusion from collaborators
    - prevent frustration from future reads
- Testing
    - save time over manual testing
    - find and fix bugs
    - run tests anywhere/anytime
- Git and version control

## Implementing PEP8



We can use `pycodestyle` to check for PEP8 violations in our scripts

In [1]:
!pip install pycodestyle



In [2]:
import pycodestyle

style_checker = pycodestyle.StyleGuide()

# run PEP8 check on multiple files
result = style_checker.check_files(["../datasets/nay_pep8.py", "../datasets/yay_pep8.py"])

../datasets/nay_pep8.py:1:1: E265 block comment should start with '# '
../datasets/nay_pep8.py:2:6: E225 missing whitespace around operator
../datasets/nay_pep8.py:4:2: E131 continuation line unaligned for hanging indent
../datasets/nay_pep8.py:5:6: E131 continuation line unaligned for hanging indent
../datasets/nay_pep8.py:6:1: E122 continuation line missing indentation or outdented
../datasets/nay_pep8.py:7:1: E265 block comment should start with '# '
../datasets/nay_pep8.py:8:1: E402 module level import not at top of file
../datasets/nay_pep8.py:9:1: E265 block comment should start with '# '
../datasets/nay_pep8.py:10:1: E302 expected 2 blank lines, found 0
../datasets/nay_pep8.py:10:18: E231 missing whitespace after ','
../datasets/nay_pep8.py:11:2: E111 indentation is not a multiple of four
../datasets/nay_pep8.py:12:2: E111 indentation is not a multiple of four
../datasets/nay_pep8.py:14:1: E265 block comment should start with '# '
../datasets/nay_pep8.py:15:1: E305 expected 2 bl

# Writing a Python module

A minimal Python package consists of 2 elements: a directory and a python file. The name of the directory should be the name of the package. According to PEP8, it should be `package_name`, in a way that describes its functionality. The file in the file **must** be called `__init__.py`

> As of Python 3.3, any directory can be imported as if it were a package without error even if it doesn't follow the structure

Our structure is:

`
work_dir
|-- my_script.py
|-- package_name
|----__init__py
`

In [3]:
import package_name

# shouldn't output anything useful since we haven't added help docs
help(package_name)

Help on package package_name:

NAME
    package_name

PACKAGE CONTENTS
    utils

FILE
    (built-in)




We can add other files to our package, changing the tree to

`
work_dir
|-- my_script.py
|-- package_name
|---- __init__py
|---- utils.py
`

In this case, `utils.py` is a submodule and can be imported with 

`import my_package.utils`

In [4]:
import package_name.utils

package_name.utils.we_need_to_talk(break_up=True)

It's not you, it's me...


We can also use the package's `__init__.py` file to make our utils functions more easily accessible.

In [5]:
# in __init__.py

# from .utils import we_need_to_talk

We can continue extending the tree like so

We can add other files to our package, changing the tree to

`
work_dir
|-- my_script.py
|-- package_name
|---- __init__py
|---- utils.py
|---- datacamp.py
|---- analysis.py
`

We should only import the package's key functionality in the `__init__.py` file to make it directly and easily accessible.

We can also add subpackages to our module, making the tree look like so

`
work_dir
|-- my_script.py
|-- package_name
|---- __init__py
|---- utils.py
|---- datacamp.py
|---- analysis.py
|---- sub_package
|-------- __init__.py
|-------- functionality.py
`

## Making our package portable

The two main steps to sharing a package are creating:
- `setup.py`
- `requirements.txt`

These should go in the tree in

`
work_dir
|-- my_script.py
|-- package_name
|-- requirements.txt
|-- setup.py
|---- __init__py
|---- utils.py
|---- datacamp.py
|---- analysis.py
|---- sub_package
|-------- __init__.py
|-------- functionality.py
`

### Requirements 

- How to recreate the environment needed to properly use your package
    - list of python packages and (optionally) the version requirement for each package 

In [None]:
# needed packages/versions
# 3 ways of specifying the packages and versions
matplotlib
numpy==1.15.4
pycodestyle>=2.4.0

In [None]:
# this installs the enviroment specified
!pip install -r requirements.txt

### `setup.py`

- How to install our package
- Provide info to PyPy if we decide to publish

In [None]:
# Import needed function from setuptools
from setuptools import setup

# Create proper setup to be used by pip
setup(name='text_analyzer',
      version='0.0.1',
      description='Perform and visualize a text anaylsis.',
      author='Miguel Carvalho',
      # packages is the location of all __init__ files
      packages=['text_analyzer'],
      # requirements for the package to work
      install_requires=['matplotlib>=3.0.0'])

# Utilizing Classes

If we want to leverage modularity in Python, we will use *classes* (we won't go into Object-Oriented Programming fully here)

A minimal class definition consists of

In [10]:
# Camel-case, never underscore
class MyClass:
    """A minimal class example
    
    :param value: value to set as the ''attribute'' attribute
    :ivar value: contains the contents of the ''value'' passed in init
    """
    
    # Method to create a new instance of MyClass
    def __init__(self, value):
        # Define aatribute with the contents of the value param
        self.attribute = value

## Using a class in a package

In [2]:
import package_name

my_instance = package_name.MyClass()

AttributeError: module 'package_name' has no attribute 'MyClass'

## `self`

- A way to refer to a class instance, even though we don't know what the user will name their instance
- When defining typical class instance methods, like `__init__`, `self` is the first argument
- However, when the user creates an instance, he does not need to pass `self` explicitly: this is done under the hood
- Once in the method body, we can use `self` to access or define attributes

## Methods in the class

- According to convention, methods which are not to be called by the user themselves and instead only through other class methods, should be prefixed with a single "_"
- This marks the method as private and not intended for direct usage by the user (even though it technically is possible)

In [None]:
class Document:
    def __init__(self, text):
        self.text = text
        # Tokenize the document with non-public tokenize method
        self.tokens = self._tokenize()
        # Perform word count with non-public count_words method
        self.word_counts = self._count_words()

    def _tokenize(self):
        return tokenize(self.text)

    # non-public method to tally document's word counts with Counter
    def _count_words(self):
        return Counter(self.tokens)

## Classes and the DRY principle



- Don't Repeat Yourself
    - Reuse code
    - Improve maintainability
- Inheritance
    - Start with a parent class and we pass on its functionality to a child class
    - The child inherits all methods and attributes of the parent class and is able to extend its functionality
    
### How to use inheritance in Python?

1. Import the parent class
2. Pass the parent class as an argument to child class  statement
3. Call the parent class's init method (recall: init builds an instance of a class and accepts `self` as its first argument)
    1. With this call we're building an instance of the parent class and storing it right back into `self`
    2. This means `self` now has all the methods and attributes the parent class would

In [None]:
class Document:
    # Initialize a new Document instance
    def __init__(self, text):
        self.text = text
        # Pre tokenize the document with non-public tokenize method
        self.tokens = self._tokenize()
        # Pre tokenize the document with non-public count_words
        self.word_counts = self._count_words()

    def _tokenize(self):
        return tokenize(self.text)

    # Non-public method to tally document's word counts
    def _count_words(self):
        # Use collections.Counter to count the document's tokens
        return Counter(self.tokens)

In [None]:
# Define a SocialMedia class that is a child of the `Document class`
class SocialMedia(Document):
    def __init__(self, text):
        Document.__init__(self, text)
        self.hashtag_counts = self._count_hashtags()
        self.mention_counts = self._count_mentions()
        
    def _count_hashtags(self):
        # Filter attribute so only words starting with '#' remain
        return filter_word_counts(self.word_counts, first_char='#')      
    
    def _count_mentions(self):
        # Filter attribute so only words starting with '@' remain
        return filter_word_counts(self.word_counts, first_char='@')


## Multilevel Inheritance

If we want to continue the inheritance tree, we can. We can create a class which is a subset of the `SocialMedia` class which inherits from it. 

We can even have a class which inherits from multiple parents - i.e. multiple inheritance.

### Two ways for multiple inheritance: `super` and parent class

In [7]:
class Parent():
    def __init__(self):
        print("I'm a parent!")
        
class Child(Parent):
    def __init__(self):
        Parent.__init__()
        print("I'm a super child!")
        
class Grandchild(SuperChild):
    def __init__(self):
        # instead of calling directly the init of parent
        # we use the super function
        
        # no functional difference but better for maintainability
        super().__init__()
        print("I'm a grandchild!")

In [8]:
Grandchild()

I'm a parent!
I'm a super child!
I'm a grandchild!


<__main__.Grandchild at 0x10c02c950>

In [9]:
class Parent():
    def __init__(self):
        print("I'm a parent!")
        
class SuperChild(Parent):
    def __init__(self):
        # notice super() and not calling directly parent class
        super().__init__() 
        print("I'm a super child!")
        
class Grandchild(SuperChild):
    def __init__(self):
        # notice super() and not calling directly parent class
        super().__init__()
        print("I'm a grandchild!")

In [10]:
Grandchild()

I'm a parent!
I'm a super child!
I'm a grandchild!


<__main__.Grandchild at 0x110ae53d0>

If we need to find out all the methods of the class

In [12]:
# do not use in scripts
dir(Grandchild)

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__str__',
 '__subclasshook__',
 '__weakref__']

# Documentation

Documentation can come in many forms:
- Comments
    - used inline to explain code
    - goal is to make code more readable for people
    - explain *why* a line of code is doing something instead of *what*
    -
- Docstrings
    - Audience is end-users
    - It's what Python outputs when a user calls `help`
    
## Anatomy of a docstring

In [None]:
def function(x):
    """High level description of function
    
    Additional details on function
    
    :param x: description of parameter x
    :return: description of return value
    
    >>> # Example function page
    Expected output of example function usage
    """

This syntax is used by convention since downstream tools can optimise for it and convert into website based docs.

In [13]:
def square(x):
    """Square the number x
    
    :param x: number to square
    :return: x squared
    
    >>> square(2)
    4
    """
    # 'x * x' is faster than 'x ** 2'
    # reference: https://stackoverflow.com/a/29055266/5731525
    return x * x

In [14]:
help(square)

Help on function square in module __main__:

square(x)
    Square the number x
    
    :param x: number to square
    :return: x squared
    
    >>> square(2)
    4



## Readability counts

Some rules of thumb:
- Good to describe what the function does without being overly descriptive
- If a function cannot fit on the screen it's probably too big
- We should split different functionality across functions
- Functions should achieve only one thing
- If it's hard to think of a meaningful name for the function, then ti's probably doing too much