# DATACAMP - Software Engineering For Data Scientists In Python

- Coding should not just be a means to an end
    
Key Concepts covered:
- Modularity 
- Documentation
- Testing (automated)

### Modularity is:
- shorter, functional units 
- improves readability 
- easier to maintain 
- transferability between projects
- can create modular code in python by leveraging packages, classes & methods

__Pypi (Python package index)__ 
- Repo of software for Python - thus can leverage published packages

__Pip (Pip Installs Packages)__
- can use within IDE OR 
- command line (anaconda prompt):
    $ pip install pycodestyle
    
__Pep8__ 
- python enhancement proposals
- defacto style guide for python
- use an IDE that flags violations as soon as a bad line of code is written OR use pycodestyle package

__Pycodestyle__ 
- checks your code against PEP8 code convention
- can check in multiple files at once
- can run pycodestyle from the shell e.g. $ pycodestyle dummy_script.py
- output is description of pep8 violations + location of issue
- output is truncated & does not used 0-based indexing 

__Help()__ 
- can call help on any object in python
- help(42)
- help(numpy)
- help(numpy.busday_count) 



# IDE (Integrated Development Enviroment) vs Text editor 

IDE Examples : Jupyter, Pycharm, Spyder (3 most popular python IDEs in 2018)
        - N.B. Jupyter stands for Julia-Python-R (i.e. multi-language IDE - allows HTML, LaTeX etc too)
    - Jupyter is a web application based server-client structure --> allows it to integrate web libraries for dataviz inlcuding plotly.js
    - Jupyter - to use, start server up in our terminal and leave running, otherwise lose access to Jupyter notebooks. The Notebook program creates a “web server” locally on your computer that you then connect to --> Localhost:8888
        
Text Editor Examples : Vim, Notepad++

IDEs  VS. Text Editors
- You can code in almost any software, from command shell to Windows notepad
- BUT/ IDE enables you to execute code & debug errors 
- Despite this, text editors v popular - many have add-ons (e.g. DBGP for Notepad++ lets you debug and execute your code)
- Whereas IDE features are already built in and don't require installation (can mean shallower learning curve)
- Even though the IDEs come with all these features already installed, installing a text editor is far easier, and they are lightweight compared to the IDEs

IS THERE A BEST IDE? 
- v subjective & depends on coding level
- Jupyter is more than just IDE - an education tool, for presentations, and even for writing blogs
- Jupyter can export your notebook from .ipynb format to PDF and HTML files, or you can just export it as a .py file



In [None]:
import pycodestyle

# Create a StyleGuide instance
style_checker = pycodestyle.StyleGuide()

# Run PEP 8 check on multiple files 
result = style_checker.check_files(['nay_pep8.py', 'yay_pep8.py'])
print(result.messages)

# Writing your first package 

A minimal python package consists of 2 elements: 

1. Directory (name of directory = name of the package)
    - Should be short, lower case name
    - Should describe function of package e.g. recoding_survey_data

2. Python file (named __init__.py)
    - A blank file that lets python know that this directory (e.g recoding_survey_data) is a package
    - N.B. you can import some functions in the __init__.py file

working_dir
├── text_analyzer
│    ├── __init__.py
│    ├── counter_utils.py
└── my_script.py

- Can thus import this package just like numpy into our current script (my_script.py):

import text_analyzer as ta

- Can also add a single file i.e. submodule within this directory e.g. working_dir/text_analyzer/counter_utils.py
- This utils file could contain some UDFs e.g. def binarise_responses() 
- This utils file is known as a sub-module:

import text_analyzer.counter_utils.py

- If working in working_dir/my_script.py  :

import text_analyzer
text_analyzer.binarise_responses(is.na=F)


### Relative import syntax (an alternative to the above) 

- Could use  __init__.py to import functionality (if working in working_dir/text_analyzer/__init__.py )

from .counter_utils import binarise_responses, sum_counters 



- What if multiple submodules? Should you import all of these?
    -  No...should include key functions in __init__.py file 

# Import local package
import text_analyser
# Sum word_counts using sum_counters from text_analyzer
word_count_totals = text_analyser.sum_counters(word_counts)
# Plot word_count_totals using plot_counter from text_analyzer
text_analyser.plot_counter(word_count_totals)

# Sharing a python package (i.e. package portability) 

- This requires: 
    1. setup.py 
    - Enables package installation by telling pip how to install it
    - This info will be used by PyPi if you decide to publish software
    2. requirements.txt  
    - how to recreate the environment needed to properly use your package
    - includes a list of python packages + version requirements for each package

- These are located at the same level as our package directory:

working_dir
├── text_analyzer
│    ├── __init__.py
│    ├── counter_utils.py
└── setup.py
└── requirements.txt
└── my_script.py

- Leverage the requirements.txt using pip install in our shell (presume anaconda) 

$ pip install -r requirements.txt         (installs everything listed in requirements.txt file) 

- Need to make sure in same directory as our package before pip installing a package: 

$ pip install .                            (will install package at an environment level)

In [5]:
##### Contents of requirements.txt ####

# Needed packages are: 
matplotlib
numpy==1.15.4
pycodestyle>=2.4.0  


#### Contents of setup.py #### 
from setuptools import setup

setup(name='text_analyzer',
      version='0.0.1',
      description='Perform and visualize a text anaylsis.',
      author='stuart',
      packages=['text_analyzer'])

SyntaxError: invalid syntax (<ipython-input-5-eda19a284ef1>, line 5)

### Object Oriented Programming (OOP)

- Recommended way to write modular code & reap the benefits of modularity
- Is easy to understand + extensible code
- Involves using classes to strengthen your package's functionality 
- __init__ is used within classes when a user wants to create an instance of a class 
- if we want specific attributes as soon as the instance is created, we include further attributes and methods in the UDF
- pep8 requires a class is written in CamelCase

Class = the blueprint for creating an instance of that class – to avoid having to create standardised characteristics each time e.g. human
Instance = instantiated class with its own unique attributes/methods e.g. Amy/ Stuart  
E.g.2 iphone blueprint = class, iphone9/ iphoneX  = instance 


Why use classes? 
- Logically group our data and functions in a way that’s easy to reuse and also easy to build upon.
- Attributes (the data associated with a class) and methods (a function associated with a class) associated with a specific class 


In [None]:
class MyClass:
    """
    Some documentation here which will appear when a user calls help(MyClass)
    """
    
    # A method to create a new instance of MyClass
    def __init__(self, value):
        # define attribute with the contents of the value param
        self.attribute = value

In [19]:
class Employee:
    
    ''' self is the instance so will get passed through the init method automatically '''
    def __init__(self, first, last, pay):
        self.first = first 
        self.last = last 
        self.pay = pay
        self.email = first + '.' + last + '@sky.uk'
        
    def full_name(self):
        return '{} {}'.format(self.first, self.last)
    
    def apply_raise(self, raise_amount):
        self.pay = int(self.pay * raise_amount)
        
employee1 = Employee('Coree', 'Schafer', 50000)
employee2 = Employee('Brett', 'Owen', 60000)

employee1.first = 'Corey'

print(employee1.email)

print(employee1.full_name())   # this is the instance so dont need to pass an argument 
# Alternatively, can run the class itself: 
Employee.full_name(employee1)         

employee1.apply_raise(1.07)
print(employee1.pay)

Coree.Schafer@sky.uk
Corey Schafer
53500


- To make our class easily accessible, add it to our __init__.py file 
- We use relative import syntax to import MyClass from my_class.py file in the same directory 

What is 'self' ? 
- A way to refer to a class instance even though we don't know what the user is going to name the instance 
- when defining class instance methods like __init__, 'self' is the first argument 
- when using __init__, the user doesnt need to pass a value to this self argument - this is done automatically behind the scenes 
- once in the method body, we can use 'self' to access or define attributes  

In [None]:
# working in work_dir/my_package/__init__.py
from .my_class import MyClass
# working in work_dir/my_script.py
import my_package
my_instance = my_package.MyClass(value='class attribute value')

# print out class attribute value 
print(my_instance.attribute)

In [None]:
# EXAMPLE - within your text_analyser package folder, have a document.py file with the following content:

# Define Document class
class Document:
    """A class for text analysis
    
    :param text: string of text to be analyzed
    :ivar text: string of text to be analyzed; set by `text` parameter
    """
    # Method to create a new instance of MyClass
    def __init__(self, text):
        # Store text parameter to the text attribute
        self.text = text

In [None]:
# To then import this using relative import syntax
from .document import Document

In [None]:
# Import custom text_analyzer package
import text_analyzer

# Create an instance of Document with datacamp_tweet
my_document = text_analyzer.Document(text=datacamp_tweet)

# Print the text attribute of the Document instance
print(my_document.text)

- But our Document class is not very useful --> just a container for the user provided text which doesn't add much value for the user
- Should add more attributes & methods beside __init__ which adds useful functionality automatically for the user, without them having to think about it E.G. LinearRegression Class - lm.coef_ 
    - E.G. tokenise text as soon as instance created - saves time if inevitable step! 

In [2]:
from .token_utils import tokenize   # needed for tokenisation of words 

class Document: 
    """
    """
    def __init__(self, text, token_regex = r'[a-zA-Z]+'):
        self.text = text
        # tokenize with non-public tokenize method
        self.tokens = self._tokenize() 
        # Perform word count with non-public count_words method
        self.word_counts = self._count_words()
        
    def _tokenize(self):
        return tokenize(self.text)
    
 #  word frequency - bag-of-words
    def _count_words(self):
        return Counter(self.tokens)
    
#### Non-Public Methods 
# In the sense that no one actually sees this naming and method 
    # they just see the LHS - in this case .tokens() for _tokenize and .word_counts for _count_words
# Have a leading underscore such as in _tokenize & _count_words methods - requirement according to pep8
# User doesn't need to explicitly call tokenize() themselves 
# Signals to user that this method is only to be used inside the package i.e. internally only

# To then use this class: 

# create a new document instance from datacamp_tweets
datacamp_doc = Document(datacamp_tweets)

# print the first 5 tokens from datacamp_doc
print(datacamp_doc.tokens[:5])

# print the top 5 most used words in datacamp_doc
print(datacamp_doc.word_counts.most_common(5))


# Now what if we wanted to specialise this functionality to focus on social media text:

text_analyser package --> Document class --> SocialMedia Class

- Document class should be preserved to be used for more general analysis
- Possible solution? Copy + paste Document class --> VIOLATES THE DRY PRINCIPLE (a concept in software engineering)

### DRY Principle = Don't Repeat Yourself
- Ensures you write modular, readable code 
- Involves writing reusable functions, classes and packages 
- Means if find a bug, only need to change it in one bit and not every bit it affects --> if you stay DRY you only need to fix the bug once
- Leverage the OOP concept called 'INHERITANCE' to make a child class

### Inheritance  
- You start with a parent class and pass on its functionality to a child class
- The child class inherits all the methods and attributes of its parents 
- Able to add additional functionality without affecting the parent class 
- The SocialMedia Class should be at the same level as the rest of the package's (text_analyser) code


In [None]:
# First, import ParentClass:
from .parent_class import ParentClass

# Create a child class with inheritance 
class ChildClass(ParentClass):
    def __init__(self):
        # call parent's init method:
        ParentClass.__init__(self)    # self now has all the attributes & methods that an instance of ParentClass would have 
        # create fake attribute
        self.fake_attribute = fake_attribute(self, n=5)
        
# To use this and create an instance of our new child class: 
child_instance = ChildClass()
print(child_instance.fake_attribute)
print(child_instance.parent_attribute)   # i.e. can access attributes of the parent as well as the child class 

In [None]:
# Define a SocialMedia class that is a child of the `Document class`
class SocialMedia(Document):
    def __init__(self, text):
        Document.__init__(self, text)
        self.hashtag_counts = self._count_hashtags()
        self.mention_counts = self._count_mentions()
        
    def _count_hashtags(self):
        # Filter attribute so only words starting with '#' remain
        return filter_word_counts(self.word_counts, first_char='#')
    
    def _count_mentions(self):
        # Filter attribute so only words starting with '@' remain
        return filter_word_counts(self.word_counts, first_char='@')
    
    
####### Now, to use this ############

# Import custom text_analyzer package
import text_analyzer

# Create a SocialMedia instance with datacamp_tweets
dc_tweets = text_analyzer.SocialMedia(text=datacamp_tweets)  # instantiate SocialMedia Class

# Print the top five most most mentioned users
print(dc_tweets.mention_counts.most_common(5))   # attribute of instance

# Plot the most used hashtags
text_analyzer.plot_counter(dc_tweets.hashtag_counts)

## Multi-level inheritance (adding a grandchild class)

- Useful if we want even more specific functionality : Documents() --> SocialMedia() --> Tweets() (created within a Tweets.py file)
- Inheritance allows us to pass on all traits from all prior classes in the family tree (parent-->child-->grandchild)
- Multiple classes can inherit from the same parent
- In fact, one child class can inherit from multiple parents --> called 'Multiple Inheritance' (an advanced OOP concept, not covered in this tutorial)


## Can be difficult to remember the attributes and methods of inherited classes

- If using IDE, can use tab complete to see all associated methods/classes 
- But if want to use console:
     - help() only includes public methods in its output 
     - dir() provides fairly exhaustive list of what your class has under the covers 
     
N.B. dir() is not advised for use in actual .py script 

In [None]:
# Multi-level inheritance & super()

class Parent:
    def __init__(self):
        print('Im a parent')
        
class Child(Parent):
    def __init__(self):
        Parent.__init__()
        print('Im a child')

# Using super() ?? 
# Has no functional difference to the Child() class above BUT advantages in maintainability & when implementing multiple inheritance

class SuperChild(Parent):
    def __init__(self):
        super().__init__()
        print('Im a super child')
        
# super() facilitates identical code, is just the argument you pass into the class that changes 
class SuperGrandChild(Child):
    def __init__(self):
        super().__init__()
        print('Im a grandchild')
        
# dir() 
dir(package_name.ClassName)
dir(package_name)

# help 
help(my_doc.plot_counts)

In [None]:
# Tweet class inherits from parent class SocialMedia
class Tweets(SocialMedia):
    def __init__(self, text):
        # Call parent's __init__ with super()
        super().__init__(text)
        # Define retweets attribute with non-public method
        self.retweets = self._process_retweets()

    def _process_retweets(self):
        # Filter tweet text to only include retweets
        # help(filter_lines)
        retweet_text = filter_lines(self.text, first_chars='RT')
        # Return retweet_text as a SocialMedia object
        return SocialMedia(retweet_text)


# Docstrings
Anatomy of docstring is as follows: 
- High level description
- Params
- Output 
- Chevrons to show example function call and next line will show expected output 

N.B. Use colons here to allow it to be read downstream by flask

In [None]:
def function(x):
    """ High level description of function here 
    
    Additional details on function
    
    :param x: description of param x 
    :return: description of return value i.e. output 
    
    >>> function(2)
    4
       
    """

In [None]:
def tokenize(text, regex=r'[a-zA-z]+'):
  """Split text into tokens using a regular expression

  :param text: text to be tokenized
  :param regex: regular expression used to match tokens using re.findall 
  :return: a list of resulting tokens

  >>> tokenize('the rain in spain')
  ['the', 'rain', 'in', 'spain']
  """
  return re.findall(regex, text, flags=re.IGNORECASE)

Zen of Python 
- Ensures readability of code 
- view them by typing: 'import this'
- descriptive naming of functions (but not to excess- will make code harder to read, longer to type)

When to refactor your function:
- If function is too long 
- If separate processes happening at same time - a function should do one thing! 
    - Your function is doing too much if it's hard to come up with a meaningful name for it

In [3]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


# Refactoring your function 

- Refactoring longer functions into smaller units can help with both readability and modularity
- Below, we exclude documentation but should always be included

In [None]:
def polygon_area(n_sides, side_len):
    """Find the area of a regular polygon

    :param n_sides: number of sides
    :param side_len: length of polygon sides
    :return: area of polygon

    >>> round(polygon_area(4, 5))
    25
    """
    perimeter = n_sides * side_len

    apothem_denominator = 2 * math.tan(math.pi / n_sides)
    apothem = side_len / apothem_denominator

    return perimeter * apothem / 2


#### Refactoring the above, we get: 

def polygon_perimeter(n_sides, side_len):
    return n_sides * side_len

def polygon_apothem(n_sides, side_len):
    denominator = 2 * math.tan(math.pi / n_sides)
    return side_len / denominator

def polygon_area(n_sides, side_len):
    perimeter = polygon_perimeter(n_sides, side_len)
    apothem = polygon_apothem(n_sides, side_len)

    return perimeter * apothem / 2

# Print the area of a hexagon with legs of size 10
print(polygon_area(n_sides=6, side_len=10))

## Unit Testing - Why? 

- Confirms code is working as intended 
- Rather than manual tests you run in the console, should add these to your test_suite.py file 
- Can re-run these tests using the test_suite.py file after making any amendments to the code to ensure no unexpected effects 
- Protects against any dependencies in code

Testing in Python - two options are:

1. doctest 
    - use if you're writing full docstrings with examples
    - is a simple way to minimally test your functions
    - i.e. for small use cases like: def multiply(a,b):
    - but if function returns large pandas dataframe, can be hard to include as a doctest
    
2. pytest
    - better handles these larger cases to be tested 
    - recommends a tests directory at the same level as your package's directory 
    - but if developing a larger package with subpackages might be better to break out subpackage tests into their own folder 
    
    How pytest works? 
    - First, searches for files that start/end with the word 'test' e.g test_suite.py 
    - For these files, it runs all the functions contained in them 

In [None]:
import doctest

def sum_counters(counters):
    """Aggregate collections.Counter objects by summing counts

    :param counters: list/tuple of counters to sum
    :return: aggregated counters with counts summed

    >>> d1 = text_analyzer.Document('1 2 fizz 4 buzz fizz 7 8')
    >>> d2 = text_analyzer.Document('fizz buzz 11 fizz 13 14')
    >>> sum_counters([d1.word_counts, d2.word_counts])
    Counter({'fizz': 4, 'buzz': 2})
    """
    return sum(counters, Counter())

doctest.testmod()   # output blank if no failed tests


In [None]:
# pytest examples:

# workdir/tests/test_suite.py
from text_analyzer import Document

def test_document_tokens():
    #created an instance of Document as our test case 
    doc = Document('a e i o u')
    # assert keyword actually runs the test 
    assert doc.tokens == ['a', 'e', 'i', 'o', 'u']
    # If assertion is true, test passes 
    
# Good idea to test for the edge cases too (e.g. expected attributes of a blank document)

def test_document_empty():
    doc = Document('')
    assert doc.tokens == []
    assert doc.word_counts == Counter()

# When testing class objects, should not compare 2 objects with '=='
doc_a = Document('a e i o u')
doc_b = Document('a e i o u')
print(doc_a == doc_b)   # returns False 

# Instead, should compare with attributes:

print(doc_a.tokens == doc_b.tokens)
print(doc_a.word_counts == doc_b.word_counts)


###### TO RUN OUR TEST ######

# Head to terminal in our 'workdir' directory
# Run the command 'pytest' and wait for our output
$ pytest 

# If we want to only run the commands for one file:
$ pytest tests/test_suite.py   

# IF FAILS, explains why it failed 

In [None]:
# Another example of pytest

from collections import Counter
from text_analyzer import SocialMedia

# Create an instance of SocialMedia for testing
test_post = 'learning #python & #rstats is awesome! thanks @datacamp!'
sm_post = SocialMedia(test_post)

# Test hashtag counts are created properly
def test_social_media_hashtags():
    expected_hashtag_counts = Counter({'#python': 1, '#rstats': 1})
    assert sm_post.hashtag_counts == expected_hashtag_counts

# Various Software Tools exist to ensure
a) modularity

b) documentation

c) automatic testing

### 1. Sphinx (renders docstrings as html documentation)

- Transforms docstrings into beautiful html documentation
- If managing project with Github / Gitlab:
    - can host Sphinx-built documentation for free using the products' services
    - allows users to search the web to find your documentation
    
- Would use sphinx on the below:

In [2]:
from text_analyzer import Document

class SocialMedia(Document):
    """Analyze text data from social media
    
    :param text: social media text to analyze

    :ivar hashtag_counts: Counter object containing counts of hashtags used in text
    :ivar mention_counts: Counter object containing counts of @mentions used in text
    """
    def __init__(self, text):
        Document.__init__(self, text)
        self.hashtag_counts = self._count_hashtags()
        self.mention_counts = self._count_mentions()

### 2. TravisCI
- CI = Continuous Integration (means you're just adding new code to your project)
- When changing your code, Travis can automatically run your tests for you and alerts you if your changes broke something 
- Then, when you push a fix, it runs your tests again to double check your fix was successful 
- Feature of TravisCI is 'Scheduled Builds'
    - By scheduling a build, your tests can be run automatically, at pre-defined interims, without you even adding new code
    - Ensures you catch bugs introduced by updates in one of your dependencies 

### 3. Codecov 

- Let's you explore which parts of your code are being tested by your automatic tests (called test coverage)
- By keeping your projects test coverage high, you will be less prone to surprise bugs 


### 4. Code Climate 

- Tool that analyses your code's readability 
- Indicates when a function is getting too long or even if your code is a little confusing in spots 
- Can point out if your code isn't modular