<a href="https://colab.research.google.com/github/vlx300/kb_colab/blob/master/Downloading_and_Exploring_Data_in_Python.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Downloading and Exploring Data in Python**

![alt text](https://www.prospecta.com/images/data-science-banner.jpg)

In this section, we will download data sets from online sources and create working visualizations of that data . the ability to analyze data allows you to discovered patterns in the data that no one knew was there. We will access and visualize data in two common data formats (**CSV and JSON**). We will urilize the python CSV module to process weather data stored in CSV format. We will then utilize Matplotlib to generate a chart based upon the downloaded data.

We will also use the JSON module to access data stored in JSON format and us Pygal to population map by country. 

#**Mount google Drive**

In [0]:
 from google.colab import drive  # mount google drive #
drive.mount('/content/drive')

##**Append sys.path for Google Drive**

In [0]:
import sys
sys.path.append("/content/drive/My Drive")

##**CSV File Format**

One simple way to store data in a text file is too write the data as a series of **values seperated by commas** (Comma-Seperated-Values(CSV))

for this excercise , we will use weather data for **January 5, 2014 in Sitka Alaska**. It includes the day's high and Low tempratures as well as a number of other measurements for the day. CSV files can be a bit tricky for Humans to read but easy for programs to process and extract values from, which speeds data analysis greatly. 

weather data was orginally download from https://www.wunderground.com/history

###**Parsing CSV file Headers**

Python's CSV module in the standard library parses the lines in a CSV file and allows us to quickly extract the values were interested in. Lets examine the first line of the file, which normally contains a series of header for the data. 

In [0]:
import csv   # Import csv module #

filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  # store the name of the file in filename #
with open (filename) as f:  # we open the file and store the resulting file object in f #
  reader = csv.reader(f)  # next we call csv.reader() and pass it the file object (f) as a argument to create the reader object associuated with that file. We stor the reader object in reader # 
  header_row = next(reader)  # the CSV module contains the next() function which will return the next line in the file when passed the reader object. the result is stored in header-row #    
  print(header_row)

  # you can see below the headers contain meaningful weather-related information each line of data holds # 


*reader* processes the 1st line of comma-seperated-values (CSV) in the file and stores each as an item in the list. the header 'AKDT'  represents **Alaska Daylight Time**.  The **position** of this header indicates that the first value in EACH line will be a date/time. 

The ' ***Max TemperatureF*** ' hesder tell us that the 2nd value in each line is the maximun temperature for that date (in degrees Fahrenheit). you can read thru the rest of the headers to determiune the kind of information included in the file.

**Note**: the headers are NOT always formatted consistently. Spaces and units are in odd places. Tis is common in raw data files but wont cause a problem

###**Printing the headers and their positions**

To make it easier to understandthe file header data, print each header and its position in the list

In [0]:
import csv   

filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  
with open (filename) as f:  
  reader = csv.reader(f)   
  header_row = next(reader)  

  for index, column_header in enumerate(header_row):  # we use enumerate on a list to get the index of each item as well as value. #
    print(index, column_header)
# we will see in the results the dates and their high temperatures are stored in columns 0 and 1 #
# to explore this data , we will process each row of data and extract the values with indices of 0 and 1 #

##**Extracting and Reading data**

Now that we know which column of data we need, Lets read in some of that data.  First, we'll read in the high temperature for each day  

In [0]:
import csv

# get high temperatures from file #
filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  
with open (filename) as f:  
  reader = csv.reader(f) 
  header_row = next(reader) 
  
  highs = []  # we make an empy list called highs #
  for row in reader:  # Loop thru the remaining rows in the file #
    highs.append(row[1])

  print(highs)

# the reder object continues where it last left off in te CSV file and automatically returns each line following its current position #
# because we already read in the header row above, the loop begins at 2nd line where the actual data begins #
# as you can see below, we extracted the high temperature for each date and stoired them in a list as STRINGS(' ') ...This need to be converted to numbers. #

In [0]:
import csv

filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  
with open (filename) as f:  
  reader = csv.reader(f) 
  header_row = next(reader) 
  
  highs = []  
  for row in reader:  
    high = int(row[1])  #convert strings to numbers #
    highs.append(high)

  print(highs)
  # as you can see now they are numbers and not strings #
  # Now were ready to visualize the data #

##**Plotting data in a Temperature chart** 

to visualize the temperature data we have, Lets first creat a simple plot of the daily highs using matplotlib 

In [0]:
import csv
from matplotlib import pyplot as plt

# get high temperatures from file #
filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  
with open (filename) as f:  
  reader = csv.reader(f) 
  header_row = next(reader) 
  
  highs = []  
  for row in reader:  
    high = int(row[1])  
    highs.append(high)

# plot data #
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.plot(highs, c='red')
#Format plot #
plt.title("Daily high temperatures, July 2014", fontsize=24)
plt.xlabel('', fontsize=16)
plt.ylabel("Temperature (f)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)

plt.show()

##**Plotting dates** 

We can improve our plot of the temperature dara by extracting dates for our daily highs and passing the dates and to plot() 

In [0]:
import csv
from datetime import datetime  # add datetime module #
from matplotlib import pyplot as plt

# get high temperatures from file #
filename = "/content/drive/My Drive/sitka_weather_07-2014.csv"  
with open (filename) as f:  
  reader = csv.reader(f) 
  header_row = next(reader) 
  
  dates, highs = [], [] # added empty list for dates #  
  for row in reader:  
    current_date = datetime.strptime(row[0], "%Y-%m-%d")  # for 4 digit year , make sure %Y is capitolized # 
    dates.append(current_date)
    high = int(row[1])  
    highs.append(high)

# plot data #
fig = plt.figure(dpi=128, figsize=(10, 6))
plt.plot(dates, highs, c='red')  # add dates to plot #

#Format plot #
plt.title("Daily high temperatures, July 2014", fontsize=24)
plt.xlabel('', fontsize=16)
fig.autofmt_xdate()  # draws date label diagonally to prevent from overlapping #
plt.ylabel("Temperature (f)", fontsize=16)
plt.tick_params(axis='both', which='major', labelsize=16)

plt.show()