# Coding Assessment Task 2 

## Identifying and Defining

Source: https://www.kaggle.com/datasets/mexwell/us-smoking-trend/data. The data that I am looking to analyse is how much Australians smoke cigarettes since 1980. My goal is to compare the number of cigarettes that a male or female Australian smokes. From this data we can draw conclusions such as which gender has contributed to higher number of cigarettes smoked per year and which gender is predicted to lessen/increase as the years go on. The data is publicly available and I will access it through a .csv file. 


### Definitions: 

Data Loading - the process of copying and loading data or data sets from a source file, folder or application to a database 

Data Cleaning - the process of fixing or removing incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data within a dataset

Data Analysis - the process of inspecting, cleansing, transforming, and modelling data with the goal of discovering useful information

Data Visualization - the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualisation tools

Data Reporting - collecting and formatting raw information and translating it into a digestible format to assess

### Functional Requirements

## Success Criteria:

#### Data Loading

- Description: load the .csv file into a pandas data frame 
- Input: the file is preloaded
- Output: the program will output the data in the .csv file

#### Data Cleaning

- Description: removal of irrelevant data, eg. removal of other countries
- Input: the user will either input 
- Output: the file is filtered by male, female and year depending on what the user input will be 

#### Data Analysing

- Description: To find the average and percentage population of male and female smokers 
- Input: User requests to see analysed data
- Output: Show the average and percentage

#### Data Visualisation

- Description: To find a visual representation of data requested 
- Input: User requests to see analysed data 
- Output: Show the average and percentage comparison between male and females in a bar graph

#### Data Reporting

- Description: Storing the dataset in a .csv file 
- Input: User will download the file
- Output: .csv file

### Use Cases (example of functional requirements in use)

#### Data Loading

- Actor: programmer
- Goal: To load a dataset into the system.
- Preconditions: User has a dataset file ready.
- Main Flow
    - Programmer loads the data sets into the correct folder
    - System validates the file format.
    - System loads the dataset and displays the data in a dataframe.
- Postconditions: Dataset is loaded and ready for analysis.

#### Data Cleaning

- Actor: user
- Goal: To filter desired data into the dataset 
- Preconditions: Dataset is loaded into the system
- Main Flow: 
    - User requests desired data from the interface 
    - System provides options for the user to specify (year)
    - User enters criteria
    - System filters dataset as requested by the user
- Postconditions: The dataset has been filtered and will only display desired information

#### Data Analysis

- Actor: user
- Goal: to print requested data into information that can be understood  
- Preconditions: unnecessary data has been cleaned
-  Main Flow: 
    - User inputs what they want to see (e.g. gender, percentage of population)
    - System finds the requested information 
    - System prints the requested information
- Postconditions: Data analysis are safely saved into the database

#### Generate Data Visualization

- Actor: user
- Goal: To create a visualisation from the dataset (bar graph, pie chart)
- Preconditions: The dataset is loaded and previous use cases are successfully completed
- Main Flow: 
    - User desired data to be visualised
    - User selects format for the data to be visualised in
    - System displays the desired visualisation
- Postconditions: A desired visualisation of the requested data is displayed

#### Generate Data Reporting
- Actor: User
- Goal: To store and export the data in a useable format so that the user can access and comprehend the data
- Preconditions: The dataset is loaded and the requested information is accessible
- Main Flow: 
    - The user gets access to the data
    - The user can freely access the data 
    - The user can explore the options of what they want to do with the data
    - Once all actions have been made, the user will exit the data 
- Postconditions: The user has found their desired data and has been able to process and understand the data that they were given

## Non Functional Requirements

#### Usability:
The User Interface is required to be clear and concise, with organised headings, and the appropriate text size and placement and aesthetically pleasing. The README document is required to explain essential information regarding my dataset and user interface. It is important to maximise these requirements in order to help and guide the user in order to help achieve maximum efficiency and understanding when navigating my database.

#### Reliability:
The system is required to notify users of errors in the Use Cases, also suggesting to them to read the README document again. Furthermore, if there are any possible errors in the database, users will also be informed in the README document. These requirements are essential for reliability because they make sure that all data is usable.

## Research and Planning

### Research of Chosen Issue

#### Purpose: 
The purpose of this dataset is to inform the user about Australia’s smoking trends since 1980, as well as comparing which gender contributes more. This is important because it can provide insights to the effectiveness of anti-smoking campaigns and the increased awareness of smoking side effects. Furthermore it can assist those who are concerned about reducing the smoking trends in Australia. 

#### Missing Data:
It is necessary to acknowledge missing data because it is significant to the accuracy and reliability of the entire dataset. When examining the given database, I could not find any missing data, or data that was not included in the correct places. However, in order to help achieve the purpose of the User Interface, I would’ve preferred to have the death rates of people every year included in the database so that the user could identify the trends and patterns much easier. 

#### Stakeholders:
The stakeholders of those who would benefit the most from the User Interface are most likely to be researchers/scientists and healthcare providers as well as public health officials and the press/media.

#### Use:
The information obtained from this data analysis can be used to understand the trends regarding cigarettes in Australia, which can furthermore support studies on the health impacts of smoking and help to make informed decisions to reduce smoking and provide better healthcare to patients. The data can also be used to help advocate and publicise the devastating long term effects of smoking.

### Privacy and Security

#### Data Privacy of Source:
The source of which I am sourcing the data is publicly available to all, on a public website that anyone can access. Furthermore, the data that the source needs to protect is their personal information, such as their name, date of birth, healthcare providers and possible sickness or death. The dataset does not specify personal information, and only provides information about the raw information, which is all anonymous. 

#### Application of Data Privacy:
My responsibilities in regards to data privacy is very limited because this database is already quite private with the information presented. However, my responsibility is to not distribute inaccurate information that could be proven false.

#### Cyber Security:
To maintain cyber security for a web application, these features should be implemented:
Proper user authentication is important because it ensures that only authorised users can access the web application. Authentication or verification can include usernames and passwords, and multi factor authentication like a fingerprint scan or a second password.
Password hashing basically just converts passwords that the user enters into a string of characters which is stored in the database. This is a precaution in case the application gets hacked. If there is a breach, password hashing stops hackers from getting your passwords. So instead of them getting your passwords they get something else.
Encryption is converting data to a format that can only be decrypted back into normal format with a unique key. It's like scrambling data and the only way to unscramble it is by using a special key. 

#### Data Dictionary

![image.png](attachment:image.png)

## Testing and Evaluating

### Peer Evaluation

Gus - Max constructed a very well working code taking into consideration all of his functional and nonfunctional requirements, and created a UI with a very large amount of well performing functions, especially with the more complex advanced functions he incorporated into it. His code worked effectively, specifically with reliability as I never managed to break the program using it. Max effectively visualised the data into understandable information that was analysed well and presented successfully. In conclusion Max created an effective UI for analysing a serious issue and made it easy to navigate and use.

Luca  - Max clearly fulfilled his requirements when it came to his data loading as I was able to load his CSV into the pandas dataframe and it outputted his dataset. His data was also cleaned very well, it only showed the data absolutely necessary for the user to see or data that they wanted to see. However I think he could benefit from delving further into his data analysis and showing more in depth information. Aside from that I think his project was excellent with the data being visualised well and excellent data reporting too. With his non-functional requirements, I found his project to be easily understood and it was simple to use. The reliability was also satisfactory due to him since it notifies users of errors. Overall I thought it was very informative and helpful aside from him maybe going a little shallow with the information given.

### Analyse and Conclude

#### Data visualisation:

![image-3.png](attachment:image-3.png)

![image-2.png](attachment:image-2.png)

#### Calculations:
- The main calculation I made was the sum of all the total number of smokers and the answer was accurate.
#### Accuracy 
- The information is accurate and I know it is accurate because I used the built in calculator in python to add up everything in a certain column.
#### Conclusions:
- Some of the conclusions that can be drawn from this data include the fact that the overall amount of people that smoked declined over the years. Another conclusion that can be drawn from this data is that there has always been more female smokers than male smokers. This type of data can help those in the medical field, as well as public health officials as they know who to target more. 

#### Evaluation of Final Product in Relation to Functional and Nonfunctional Requirements

In relation to my nonfunctional requirements for this task, My User Interface worked successfully. The usability of my code was quite good, as I believe everything fulfilled the requirements. For example the UI was clean and concise and the whole thing was aesthetically pleasing. This makes sure that the UI is usable. The information was easy to read, and explained what was needed of the user. Furthermore, the reliability of my UI was also a high standard. It notified users when they made mistakes and it also provided a guide of how the user was to navigate through the UI. This ensures that all data is usable. In conclusion, I fulfilled my non-functional requirements for my User Interface, as the usability and reliability were all successful and worked properly. 

In relation to my functional requirements for this task, my User Interface successfully worked. During my data loading phase, I had a problem from the start. I was not able to load my data into my .csv file. This was because I made a major mistake in the code. After downloading the data and putting it into my file, I did not type the correct code into my main.py file. However I fixed this problem by just changing my code, and I was able to load the data into my file. In relation to my data cleaning, I had no problems cleaning the data. All I had to do was delete all the rows and columns that didn’t include information about Australia. Regarding my data analysis it also ran smoothly, however, my main problem was that I lacked knowledge on proper code. For example, I didn’t know how to recall columns and rows from my dataset. However, this was an easy fix. I did some research and read through the python guides given to me by my teacher. I learnt how to fix that problem quite easily. However, in hindsight, I agree with Luca that my final product would benefit with more advanced data analysis code for more advanced functions. In relation to my data visualisation phase, I did not have any problems. Finally, the data reporting aspect of my project was successful as I was able to store, organise and output my data without any issues. In conclusion, I fulfilled my functional requirements for my User Interface, as the data loading, data cleaning, data analysis, data visualisation and data reporting were all successful in the end and all worked properly.

