# Report Writing
This notebook is to offer some guidelines on how to write a good report and have reproducible work in general.

### Table of Contents:
* [Why Should we Write Reports](#why)
* [How to Communicate Well for Reports](#how)
* [Writing Sample Example](#example)
* [Summary](#summary)

---

## Why Should We Write Reports? <a class="anchor" id="why"></a>
We want to communicate our findings to our *Readers*. So who are our *Readers*?
- Peers who want to learn and understand what you have done
- Individuals who may want to reproduce or expand on what you did
- **Recruiter / Tech Lead** who wants to look through your work quickly and see how well you can document and present your work in a clear and organized way.

Given that these are some of the main audience, it is essential you know how to communicate your work as a Data Scientist.


#### What are Some Problems Good Communication can Solve?
- Today, we often mix technical work with more high level communication
- With new tech, we have an increase of data that and we may end up with more complex code and logic or workflows
- When we look at just **code**: it is highly difficult to tell what is happening given the multidimensional levels of thought or logic to derive from
- If we just look at **text**: we often lose out on some important information (eg: Bad Journalism)


### What is our Solution?
We have a middle-ground between complex code and text to be combined into an `article`

![image info](https://i.ibb.co/Pmf2LKk/reportwriting.png)

#### What is an `Article`?
- We want to combine **figures, tables, and numerical outputs** with **code** and **text** to create an `article`.
- Using tools such as jupyter notebook allows us to combine all of these together with markdown and code cells to formulate an article
- **Text** explains “going on” and any “assumptions”, “caveats”, “critical thinking” that cannot be communicated through just the code.
- **Code** involves loading data, performing computations, or generating tables, figures, visuals, etc.
- Good examples are medium articles with code blocks, research papers for data science, notebooks, even well written blog posts.

## How to communicate well when writing reports / articles: <a class="anchor" id="how"></a>
We are weaving text with the code to **tell a story**. We should include the following:

- `Title, Introduction` - Talk about your motivation and why you are doing this.
- `Methods` - This is the bread and butter of the communication portion for your code heavy questions. Explain what tools or math you have applied for each step. Then justify your methods or walk us through your decision process and state any assumptions or caveats. In DS there are often more than one way to arrive at a solution, so explain to us why you did it this way.
- `Results` - After you have returned the outputs for your question, make sure you communicate your findings. This may include any uncertainty, confirmations, connections to other questions, or answers something that came up during exploratory analysis. Tie your findings back to the big picture / story.
- `Conclusions` - At the end talk about any conclusions you may draw from all the analysis that you did as well as any caveats, potential problems, or next steps.
- `DO NOT include every analysis` - We are trying to present our best work to present a story and have people be able to reproduce our work, not to show off our rough notes

---

## Writing Example <a class="anchor" id="example"></a>
Let's look at an example given a toy data set. Here we will:
- load in our data
- explore
- analyze
- conclude

We will use the iris dataset to try and classify the different flowers as an example. Here we go ...

#### Introduction
` Here I am talking about my motivation behind this entire project`

In this project we will attempt to accurately classify the different classes of flowers from the `iris` dataset. This is to explore if we are able to classify new data on these flowers correctly based on the features. As an iris flower enthusiast, I hope these results will help automate the process of categorizing these flowers so I can focus on studying them after they have been identified. We will be using `sklearn` and `K-nearest-neighbors` to train our model. There are a total of 3 different classes of flowers in this data set in which we will be attempting to predict. Let's load the libraries that we will require (including the dataset itself from sklearn):

In [None]:
# set up the libraries
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#### Load and Explore
`Here I am talking about my method and providing context to our reader`

Now let's load in our data. Since we are loading in the `iris` dataset as an `sklearn` built-in dataset, we will be doing this a bit differently than if we were to read a csv file. `Sklearn` has these 'toy' datasets stored as arrays. They also have the X and Y variables stored separately as methods under the class data container of the `iris` dataset. Thus, the method `.target` will return our labels while the method `.data` will return the X variables and bulk of our data. We will first instantiate our dataframe and then separate the data and labels as two separate objects.

In [None]:
# load in the data
data = load_iris(as_frame=True)

In [None]:
# target column
Y = data.target

# visualize all the classes
np.unique(Y)

array([0, 1, 2])

From above, we see that we only have only 3 unique classes of flowers. Let's take a look at the core of the data now.

In [None]:
# the data without labels
X = data.data
X

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


`results from our data loading`

Our X variables include the following features:
- sepal length (cm) - length of the sepal
- sepal width (cm) - width of the sepal
- petal length (cm) - length of the petal
- petal width (cm) - width of the petal

Hopefully our model will be able to pick up the differences based on these features between the different classes of flowers.

`now I am talking about what I will do for data exploration and cleaning`

Now let's take a look and see if we have any missing or NA values from our dataset. We could check for duplicated values as well, but since these are measurements of flowers and there could be two flowers with similar size it may be difficult to detect if there is indeed a data error or just part of the data. We could take a look to see if there are a high value of duplicates or not and see if that helps us determine whether it is a data error or not.

In [None]:
# check for NA
X.isna().sum()

sepal length (cm)    0
sepal width (cm)     0
petal length (cm)    0
petal width (cm)     0
dtype: int64

In [None]:
# check for duplicates
X.duplicated().sum()

1

`talking about the results again - stating any assumptions, uncertainty or caveats`

So it seems like we do not have any missing or NA values. We do seem to have one duplicated row, however since it there is only 1 row we cannot really tell if it is a data error or it just so happens that two flowers have similar measurements. In this case, I will assume that this is just part of the data and that it is not a data entry error. However, even if it was, it is only 1 row and I believe the error is small enough that it should not matter. For the most part, our data is clean and is ready to go. We can now move on to our analysis / modelling.

#### Analyze / Model
`I will talk about thought process and method`

We will train a KNN model to classify our data. This model works better than logistic regression since we have more than 2 classes and there may be non-linear borders. We could also employ other models and compare the results to see which one performs the best. However, for the sake of demo-ing this example combined with the fact that we have not yet gone into depth for all the other possible models, we will just default to the one we are most familiar with: KNN. If our model fails we can then look into doing further research and employing more sophisticated models and explore how they will perform. Thus, let's begin by splitting our data into train-test sets so we can train and evaluate our model and make sure it doesn't overfit.

We will split the train-test split by 70-30 respectively and maintain the class distribution by stratifying y. For reproducibility we will use a random_state of 1.

In [None]:
# perform train-test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, stratify=Y, random_state=1)

Let's check to make sure our split was successful by checking the shapes for each set:

In [None]:
print("Training dataset dimensions for X and y:", X_train.shape, y_train.shape)
print("Test dataset dimensions for X and y:", X_test.shape, y_test.shape)

Training dataset dimensions for X and y: (105, 4) (105,)
Test dataset dimensions for X and y: (45, 4) (45,)


From the dimensions above, it seems like we do have a correct train-test split for both X and y respectively. Now, we can move onto instantiating and training the model. Just a side note, we do not need to `scale` the data, because all of the values in our dataframe are of the same unit (centimeters). Therefore, I believe there is no benefit to standardizing the scale and we can just plug the data into our model as is. As for the `n` neighbours hyperparameter, we will leave it at the default value of 5. This will be a baseline to see how the model will perform out of the box. If we need to improve the performance then we can adjust for the `n` value, however if our model performs well enough we can leave it as is. This is because I do not think it is worth the time to try and improve the the accuracy by 1-2% and spending unnecessary manpower and time to do so.

In [None]:
# perform training
from sklearn.neighbors import KNeighborsClassifier

# instantiate the model
my_knn = KNeighborsClassifier()

# fit the model
my_knn = my_knn.fit(X_train, y_train)

# score the model
train_score_knn = my_knn.score(X_train,y_train)
test_score_knn = my_knn.score(X_test,y_test)

# output the results
print("Train Accuracy: ", train_score_knn)
print("Test Accuracy: ", test_score_knn)


Train Accuracy:  0.9809523809523809
Test Accuracy:  0.9777777777777777


`talk about results - here I explain why we have such a high accuracy, how we can tune this model further but also why we don't really need to at this point`

The train and test accuracies seem to be quite high right out of the box. This is expected as we are working with a very "toy" dataset and it is relatively simple and easy for the model to classify. Going back to our point of tuning hyperparameters, I believe this model is already satisfactory and we do not want to overfit the model. Thus, I believe we do not need to perform any further tuning at this point. However, if our did perform poorly, we can possible change these parameters in the KNN model:
- n_neighbors - this is our `n` in KNN as it determines how many nearest neighbours to use
- leaf_size - the larger this is the closer the algorithm will pick.
- p - whether we want to use change our scale from euclidian distance to manhattan distance

However, for the most part, we will most likely just be tuning the `n_neighbors` parameter as the other two parameters we listed often can be left at default.

#### Conclusion
`conclusion of all the findings`

From our project, we were able to successfully classify different flowers with a test accuracy of ~98%. This is due to the fact that our dataset was very simple and didn't have too many features or complexities. We were able to get away with just using one model and its default parameters to achieve such a high accuracy. However, this does not mean that this model is now ready to classify never seen before data from the real world. We have only trained our model from a very small dataset (150 rows) and evaluated the test cases from the same dataset. If we were to apply this on actual data, this accuracy may begin to fall as we run more examples. Also, since this is a 'toy' dataset there is a possibility that these flowers are very different in the four features so this may have just been a pre-chosen example. Real datasets in the real world will have be much larger and more complex, so this was more so a demo project to see what we can do with sklearn. Nevertheless, we did achieve good results with our toy dataset and we can predict other values within this dataset with a fairly high accuracy. Next steps may be to use test data from real world examples outside of the iris dataset or to gather more data and retrain our model on more classes of flowers.

## Example Summary <a class="anchor" id="summary"></a>
So that concludes our example. Hopefully this helped you see how you should communicate. Here are some takaways:

- Have an intro that tells us your motivation and have a clear question or goal where all the work you will do can tie back into it to tell a story
- For every step, talk about your method and decision process. Discuss why you will or will not do something at this step. (eg: why I chose KNN and won't use any other models)
- Guide readers through your process. In between code I always plug a sentence in to guide the readers, don't just rely on comments.
- Discuss the results and talk about them. Try to critically analyze the results and have a conversation about it after every question
- Include a conclusion where you can discuss your findings and also any caveats, potential issues, short comings, and next steps to address those. Showing readers you know the caveats demonstrates you don't have tunnel vision and know your stuff.

---