# **Identifying and Defining**
---
---
---
## **Learning Intentions:**
---
---
* Specify the functional requirements of a data analysis, including stating the purpose of a solution, describing use cases, and developing test cases of inputs and expected outputs.
* Specify the non-functional requirements of a data analysis.
---
---
## **Success Criteria:**
---
---
* I can clearly state the purpose of a data analysis solution.
* I can describe use cases that outline how the data analysis solution will be used.
* I can develop test cases that include inputs and expected outputs for the data analysis solution.
* I can specify non-functional requirements such as performance, scalability, security, and usability for a data analysis solution.
* I can explain the importance of non-functional requirements in ensuring the overall quality and effectiveness of a data analysis solution.
---
---

## **Choose your Data Scenario and Define your Purpose**
---
---
### **Data**
The data to be analysed at the Sydney house prices from 2009-2019, it contains data of how many bedrooms, bathrooms, car spaces were in the property, it has information on the type of property, location (postcode and suburb) as well as the type of property.

---
### **Goal**
The plan is to analyse the data to have an approximate idea of what drives the price of a house. I plan to ignore the price of a subrub instead focusing specifically on the amount of bedrooms, bathrooms and car spaces, or I may try to find the approximate rise in the house prices.

---
### **Source**
[Source](https://www.kaggle.com/datasets/mihirhalai/sydney-house-prices/data)

---
### **Access**
As you may have seen from the source the data is publicly available to anyone who would like to use it, all you have to do is register to the platform for no cost at all.

---
### **Access Method**
The data will be accessed as a [.csv](https://support.google.com/google-ads/answer/9004364?hl=en-AU#:~:text=A%20CSV%20comma%2Dseparated%20values,in%20a%20table%20structured%20format.) file.

---
---



## **Functional Requirements**
---
---
### **Data Loading**

* Description: It must be capable of loading a .csv file type (as that is what the data uses), if there is an error in loading the file it will likely use a try, except method to give the user a message telling them that their file could not be loaded.
* Input: The user will be prompted about if the data has been put into the correct file.
* Output: If the file can be loaded it will be loaded into a pandas table. If an incorrect format is used or the file is not in the required folder then an error messsage will appear prompting the user to check the file format anbd location before trying again.

---

In [None]:
#E.g. of such a method:
try:
    var = int(float(input("")))
    print(f"The square of {var} is {var*var}")
    
except:
    print("Non-integer input recieved.")

---
### **Data Cleaning**

* Description: The program must be capable of handling empty values (changing them to 0) and be able to filter the data based on what the user desires.
* Input: The user will be prompted on what values they would like to keep in the data (specific subrubs, dates etc.)
* Output: The program will claim that it is done and will show a pandas table of the cleansed data. 
 
---
### **Data Analysis**

* Description: The program must be able to find all common statistical measurements (e.g. mean, median, etc.)
* Input: The user will be prompted on which measurements they desire.
* Output: The program will output a table with the new measurements

---
### **Data Visualisation**

* Description: The program must be capable of creating a suitable graph for it's purpose based off user input.
* Input: The user will be prompted on graph type, they will also be given a reccomendation.
* Output: The user will be given a matplotlib graph.

---
### **Data Reporting**

Description: The program should output a new .csv file if the user desires for it to be saved.
Input: The user will be prompted on if they would like to save the graph.
Output: The data will be saved to a new .csv file or it will be lost.

---
---

## **Use Cases**
---
---
**Actor**: The User  
**Goal**: Load the Dataset  
**Preconditions**: The User has the dataset  
**Main Flow**:
1. User places the dataset in the correct folder.
2. The System checks if the dataset's file format is compatible and if the dataset is in the right file, if either are incorrect then an error message will appear specifying the issue.
3. The data is loaded and may be displayed using pandas.

**Post conditions**: Dataset is loaded and is available for analysis.

---
**Actor**: The User  
**Goal**: Fill in the blanks and filter out data that the user does not require.  
**Preconditions**: The dataset has been loaded.  
**Main Flow**:
1. The blanks are filled in with 0s.
2. The user is prompted on what data they would like to analyse.
3. The data is filtered based off this.

**Post conditions**: Dataset has been cleansed to remove unnessecary information.

---
**Actor**: The User  
**Goal**: Add the desired statistical measurements into the data for visualisation.  
**Preconditions**: The data gas been cleansed to include only the data required for analysis.  
**Main Flow**:
1. User is prompted about what statistical measurements should be included.
2. The system adds the requried measurements to the dataset, error handling included.  

**Post conditions**: Dataset now contains extra statistical information for ease of analysis.

---
**Actor**: The User  
**Goal**: Visualise the data in the desired format.  
**Preconditions**: Data has been cleansed and whatever extra statistical information desired has been added.  
**Main Flow**:
1. User is prompted about graph type, reccomendation will be included depending on extra statistical information.
2. Matplotlib will create the graph.  

**Post conditions**: Data has been visualised.

---
**Actor**: The User  
**Goal**: Export the .csv file.  
**Preconditions**: Data analysed.  
**Main Flow**:
1. User prompted on if they want the data to be exported.
2. If so the data is exported.  

**Post conditions**: Data exported if the user decided so.

---
---

## **Non-Functional Requirements**
---
---
### **Usability**
---
Usability is how easy it is to use and understand the program, if the system is too complex or too unforgiving for someone not so "tech-savy" then it would lack usability, although it is not required for function is it preferable for the usability to be high hence making it more accessible. The user interface should be easy to understand and be quite forgiving with errors, if the user interface is complex the README file should explain how to use it, the user interface should be capable of handling errors as well as backtracking if the user makes an incorrect decision, the README file should explain the file requirements and where the file should be located.

---
### **Reliability**
---
Reliability is how accurate  the data is, the data should be accurate, if not then the information gained via the analysis will be inaccurate, it is preferable for the system to explain possible errors although it is not required for the functioning of the system. The system should clearing explain what was the error and how to possibly fix it, the system should also include information about the small changes made in the cleansing process to help explain any possible inaccuracies in the final analysis.

---
## **Bonus**
### **Security**
---
Security is how well protected is the data, the higher the security the higher the protection from potentially malicious third parties, although not entirely nessecary I will likely include it if everything else is finished (including the potential GUI). The system will likely include a small security system in the form of Username and Password.

---
---
---


# **Researching and Planning**
---
---
---
## **Learning Intentions:**
---
---
Collect and interpret data while adhering to privacy and cybersecurity principles.
Represent and store data to facilitate computation, including selecting appropriate data types, understanding data type limitations, and structuring data systematically.

---
---
## **Success Criteria:**
---
---
* I can collect data following established privacy and cybersecurity principles to ensure data protection.
* I can interpret data while maintaining compliance with privacy regulations and cybersecurity practices.
* I can select appropriate data types for representing and storing data to facilitate computation.
* I can structure data systematically to support efficient computation and analysis.
---
---
## **Research of Chosen Issue**
---
---
### **Purpose**
---
The purpose of the analysis is to hopefully have a decent idea of what drives the house prices in the Sydney. The purpose is to find out how thinks such as car spaces, bathrooms and bedrooms raise the price of the house. This could be useful for further research as it sets a baseline to compare just how inflated are the Sydney house prices on average compared to differnet regions with similar properties.

---
### **Missing Data**
---
The current data lacks (along with more recent info) statistics like the average per suburb, the approximate influence of factors on the house price and many other things that may be useful.

---
### **Stakeholders**
---
Home owners, people hoping to buy a home and real estate companies.

---
### **Use**
---
If there is a better idea of how much certain factors of a home influence it's price home owners can sell their homes at a more correct price (or try to raise their profit margin) and people could *hopefully* make better financial decisions when it comes to buying homes *(best financial decision: don't buy one in Sydney)*

---
---
## **Privacy and Security**
---
---
### **Data Privacy of Source**
---
The data was scrapped off of [realestate.com.au](https://www.realestate.com.au/legal/privacy-policy), the data uses the publicly available information on the website about the type of property, postcode, room types in the property, etc. there is a lack of concern over privacy as if the houses were listed on the website the person responsible likely understood that the information would be public *(literally anyone can access it...)*. The source's source (the site) does not have many requirements other then protecting the personal information connected to the site's accounts, the source itself does not include any information not already publicly available.

---
### **Application Data Privacy**
---
I have some small responsibilities for data privacy, although all the data used was already publicly available and none of the data is directly identifiable to any given person I would still have the responsibility to remove the data from the program if a user requests so (with sufficient proof that it was information of them), although there is no identifiable personal information if this was released to the public then the prior home owners will have the right to have the record of their purchase of the house removed.

---
### **Cyber Security**
---
It should reasonably have a threat system that alerts users when security has been breached and possible data was leaked, there should be a method of user authentication, which is where the program verifys that the person is who they claim to be when attempting to access the application, this may be done through a simple username and password system, but the password may be leaked in a data breach, this is where password hashing comes in, password hashing is encrypting the password into a string of numbers and/or letters as to prevent the password being taken when data is leaked, encryption itself is where data is scrambled using a specific digital code and can only be unscrambled using the same code, reminiscent of a vignere cipher.

---
---
## **Data Dictionaries**
---
---
### **Data Frame in VS Code**
---

In [2]:
import pandas as pd
Prices_df = pd.read_csv('SydneyHousePrices.csv')
Prices_df['Date'] = pd.to_datetime(Prices_df['Date'])
Prices_df

Unnamed: 0,Date,Id,suburb,postalCode,sellPrice,bed,bath,car,propType
0,2019-06-19,1,Avalon Beach,2107,1210000,4.0,2,2.0,house
1,2019-06-13,2,Avalon Beach,2107,2250000,4.0,3,4.0,house
2,2019-06-07,3,Whale Beach,2107,2920000,3.0,3,2.0,house
3,2019-05-28,4,Avalon Beach,2107,1530000,3.0,1,2.0,house
4,2019-05-22,5,Whale Beach,2107,8000000,5.0,4,4.0,house
...,...,...,...,...,...,...,...,...,...
199499,2014-06-20,199500,Illawong,2234,1900000,5.0,3,7.0,house
199500,2014-05-26,199501,Illawong,2234,980000,4.0,3,2.0,house
199501,2014-04-17,199502,Alfords Point,2234,850000,4.0,2,2.0,house
199502,2013-09-07,199503,Illawong,2234,640000,3.0,2,2.0,townhouse


---
---
## **My Data Dictionary**
---
---
|Field |Datatype|Format for Display|Description|Example|Validation|
|------|--------|------------------|-----------|-------|----------|
|Date  |datetime64|YYYY-MM-DD|Date of when the house was sold.|2019-06-19| Must be in the format of YYYY-MM-DD, no other order possible.|
|Id    |integer|NN...NN|The house number on the data, insignificant.|1| Must only contain a number with no decimal places. No character limit.|
|suburb|object|XX...XX|The suburb which the house was in.|Avalon Beach| Must only contain letters with no numbers, may include spaces, no character limit.|
|postalCode|integer|NNNN|The postal code of the area which the house was in.|2107| Must be a 4-digit number with no decimal places.|
|sellPrice|integer|NN...NN|The price that the house was sold for.|1210000| Must be a number with no decimal places, no character limit.|
|bed|float64|N.N|The amount of bedrooms within the house.|4.0| Must be a number with a decimal place of 0, no character limit.|
|bath|integer|N|The amount of bathrooms within the house.|2| Must be a number with no decimal places, no character limit.|
|car|float64|N.N|The amount of spaces for cars within the house.|2.0| Must be a number with a decimal place of 0, no character limit.|
|propType|object|XX...XX|The type of building that was sold.|house| Must only contains letters and no numbers, must only contain lowercase letters, no character limit.|