In [None]:
import numpy as np
import pandas as pd
from google.cloud import bigquery

# 1.) List and explain the steps in a typical data science project workflow. 

1. Meet with your employer to understand what the questions you are solving will be.
2. Get or collect the data that will answer the above questions.
3. Look at the data to try to find any anomolies and either adjust or toss the bad data out.
4. Create a writup intended for your employer that states what data you are using as well as what questions you will answer
5. Meet with your employer to discuss the data and reaffirm what the purpose of the project is.
6. Fully clean the data so that it can be used for model fitting.
7. Perform a descriptive analysis on the data. This includes things like mean, median and mode for data columns.
8. Determine if the problem requires machine learning techniques or if the problem requires simple visualisation using plots.
9. if the problem can be solved through simple plotting you would create the plots and create a story or analysis to backup why these plots solve the  problem.
10. If the problem requires machine learning techniques to solve decide which ML techniques would be appropriate to use on what portions of the data.
11. Split data into train and test sets. Randomly split data
12. See if the method solved the target problem. If not go back and try a different model.
13. Present to your employer.
14. If your employer is satisfied your job is done.
15. If your employer is not satisfied you may need to collect more data or try using different models to answer the desired questions.

# 2.) List at least three common problems that a data scientist might encounter when working with an employer and potential work-arounds for these problems. 

1. Often times the lingo that a business executive uses is very different from the lingo that a data scientist might use. This leads to data scientists misunderstanding what an employer may want out of a project. This problem is best solved by having many "check in" points during a project. These check in points would involve the data scientist telling their employeer what they are doing and why they are doing it. 
2. Sometimes in projects a company will want you to provide insight using data but you either will not have enough data or the data will not provide the insight the employer wants. The best way to avoid this problem is to determine if you have enough and the right data early on in the process.
3. Often times companies will use data that the data scientist determines to be unethical. For example, a college may want a professor to perform an analysis on how students should be placed in math courses. Some of the predictors in a dataset for something like this could contain gender and race. It seems like there could be an ethical problem with placing a student in a math class based off of either of those predictors. If this happens it is best for the data scientist to say early on that they will not be using the data. It is also important that the data scientist comes prepared to defend their viewpoints. 

# 3.) What is the major difference between a database and a flat file and why are databases more useful in many situations?

The major difference between a database and a flat file is that flat files contain one table of data whereas databases can have many tables.



Databases excel at keeping company data organized. There are lots of types of data that can't be represented well in just one table. For example, if a company wanted to track both who was buying a product as well as detailed descriptions on what the products are it would be very difficult to have a well organized table with all that data. You'd probobly want a table that contained product data, a table that contained customer data and a table that contained product sale data. Each of these tables would then have primary and secondary keys linking them together. 

# 4). a). Start by reading in the dataset

In [None]:
client = bigquery.Client()
# Refernece to san fran dataset
dataset_ref = client.dataset("san_francisco", project = 'bigquery-public-data')
# Stores dataset into dataset
dataset = client.get_dataset(dataset_ref)


# b.) Produce a list of the tables within this dataset. 

In [None]:
# Stores dataset tables into tables
tables = list(client.list_tables(dataset))
# for loop listing table names
for table in tables:
    print(table.table_id)

# c.) Determine the size of data required to import the film_locations table without actually loading it.

In [None]:
# Get all data in film_locations
query1 = """
        SELECT *
        from bigquery-public-data.san_francisco.film_locations

"""
dry_run_config = bigquery.QueryJobConfig(dry_run = True)
dry_run_query_job = client.query(query1, job_config = dry_run_config)


print("This query will process {} bytes." .format(dry_run_query_job.total_bytes_processed))

# d.) Read this table and produce a data frame that counts the number of movies produced by each production company from before 1950 and listed from most movies to least. 

In [None]:
#  READING THE TABLE
# Create reference to film_locations
table_ref_film = dataset_ref.table('film_locations')
# Create dataset for film_locations
dfFilm = client.get_table(table_ref_film)

client.list_rows(dfFilm, max_results = 6).to_dataframe()

In [None]:
#  PRODUCING THE DATAFRAME
query2 = """
        SELECT 
            production_company,
            COUNT(production_company) as num_movies
        from bigquery-public-data.san_francisco.film_locations
        WHERE release_year < 1950
        GROUP BY production_company
        ORDER BY num_movies desc

"""
# put a cap on byte limit
OneHundMB = 1000*1000*100  # one hundred megabytes
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed = OneHundMB)
# view the query
treeCensus2015 = client.query(query2, job_config = safe_config)
treeCensus2015.to_dataframe()

# e.) From the table sfpd_incidents, produce a table of the most common police incidents in the category of LARCENY/THEFT in the year 2016 ordered from most to least common. 

In [None]:
#  READING THE TABLE
# Create reference to film_locations
table_ref_film = dataset_ref.table('sfpd_incidents')
# Create dataset for film_locations
dfIncidents = client.get_table(table_ref_film)

client.list_rows(dfIncidents, max_results = 6).to_dataframe()

In [None]:
#  PRODUCING THE DATAFRAME
query2 = """
        SELECT 
            descript,
            COUNT(descript) as num_occurances
        FROM bigquery-public-data.san_francisco.sfpd_incidents
        WHERE EXTRACT(YEAR FROM timestamp) = 2016
        AND category = "LARCENY/THEFT"
        GROUP BY descript
        ORDER BY num_occurances desc

"""

# put a cap on byte limit
OneThousandMB = 1000*1000*1000  # one thousand megabytes
safe_config = bigquery.QueryJobConfig(maximum_bytes_billed = OneThousandMB)
# view the query
treeCensus2015 = client.query(query2, job_config = safe_config)
treeCensus2015.to_dataframe()

# 5). a.) Using the Kaggle dataset on Melborne_housing_FULL.csv, start by reading and describing this dataset. 

In [None]:
df = pd.read_csv('../input/melbourne-houses/Melbourne_housing_FULL.csv')
df.describe()

In [None]:
# See number of rows and columns
df.shape

# b.) Using the iloc method, produce the last 10 rows of the dataset. 

In [None]:
df.iloc[-10:,:]

# c.) There are columns with missing data in this dataset. Clean the dataset by removing all rows with missing values. (This is a BAD idea, but it is where we start). 

In [None]:
df = df.dropna()

# d.) Using python code, produce rows where the Suburb is in Albert Park and the house has 3 or more bedrooms and sold in 2017. Print the 5 observations with lowest price where the price is not null. 

In [None]:
# This data set does not describe what each collumn is 
# so I decided to interperate the column 
# Bedroom2 as the number of bedrooms

In [None]:
# Checking the date datatype
df.Date.dtype

In [None]:
# Date column is not a time stamp data type so I need to split date into month, day and year

df[['Month','Day','Year']] = df['Date'].str.split('/',expand=True)
df.head()
# Month, Day, Year is at the end of the ouput

In [None]:
# Create new dataframe that only has the Albert Park Suburb
df2 = df[df['Suburb'] == 'Albert Park']
# Take the new df and create another df that looks at houses that have 3 or more bedrooms
df3 = df2.loc[df2['Bedroom2'] >= 3]
# Do the same as above but set Year = 2017
df4 = df3[df3['Year'] == '2017']


In [None]:
# Sort by price
df4 = df4.sort_values('Price', ascending = True)
df4.head(5)

# e.) Produce a list of all the unique Sellers.

In [None]:
df.SellerG.unique()