## Python Data Analysis Basics

We will learn how to:

- Calculate how old the artist was when they created their artwork.
- Analyze and interpret the distribution of artist ages.
- Create functions which summarize our data.
- Print summaries in an easy-to-read-way.

DQ has provided a dataset called '_artworks_clean.csv_' and we will analyze that but a modified version of our own.

__Update:__ I have successfully fixed the 'Date' column using the Python library 'regex'.

In [1]:
from csv import reader

# Read the `artworks_clean.csv` file
opened_file = open('data/artworks_clean.csv')
read_file = reader(opened_file)
moma = list(read_file)
moma_header = moma[0]
moma_data = moma[1:]

# Convert the birthdate values
for row in moma_data:
    birth_date = row[3]  # The Index is 3 in our dataset
    if birth_date != "":
        birth_date = int(birth_date)
    row[3] = birth_date
    
# Convert the death date values
for row in moma_data:
    death_date = row[4]  # The Index is 4 in our dataset
    if death_date != "":
        death_date = int(death_date)
    row[4] = death_date

In [2]:
# Convert the date column values
for row in moma_data:
    date = row[6]   # The Index is 6 in our dataset
    if date != "":
        date= int(date)
    row[6] = date

### Calculating Artist Ages

We're going to work on calculating the ages at which artists created their pieces of art. We need to subtract the artist's birth year (BeginDate) from the year in which their artwork was created (Date).

While every row has a value for Date, there are some that are missing values for BeginDate. When we cleaned BeginDate, we encountered some missing values and left them as empty strings (""). We'll use a value of 0 for these cases, which we'll replace with something more meaningful later on.

There are a handful of cases where the artist's age (according to our data set) is very low, including some where the age is negative. We could investigate these specific cases one by one, but since we're looking for a summary, we'll take care of these by categorizing artists younger than 20 as "Unknown" also. This has the handy effect of also categorizing the artists without birth years as "Unknown".

| Year Artwork Created (date) | Birth Year (birth) | age | final_age |
| --------------------------- | ------------------ | --- | --------- |
| 1968 | 1898 | 70 | 70        |
| 1931 |  ""  | 0  | "Unknown" |
| 1972 | 1976 | -4 | "Unknown" |

In [3]:
ages = []

for row in moma_data:
    date = row[6] # Index 6
    birth = row[3] # Index 3
    if isinstance(birth, int):
        if date == 0:
            age = 0
        else:
            age = date - birth
    else:
        age = 0
    ages.append(age)
    
final_ages = []

for age in ages:
    if age > 20:
        final_age = age
    else:
        final_age = "Unknown"
    final_ages.append(final_age)

### Converting Age to Decades

We now have a list — ages — containing the artist ages duing which each artwork was produced. Because there are many unique ages, we'll calculate only the decade in which the artist created each work. For instance, if we calculate that the artist was 24, we'll record that as the artist being in their "20s."

As a first step toward this, we'll need to remove the last digit in every age:

- 24 will become 2
- 86 will become 8
- 50 will become 5
- 106 will become 10

In [4]:
decades = []

for age in final_ages:
    if age is "Unknown":
        decade = age
    else:
        decade = str(age)
        decade = decade[:-1] + '0s'
    decades.append(decade)

### Summarizing the Decade Data

The last step of our analysis is to count how many instances of each decade there are. To do this, we're going to use a technique from the Python Fundamentals course: constructing a __frequency table__.



In [5]:
decade_frequency = {}

for row in decades:
    if row not in decade_frequency:
        decade_frequency[row] = 1
    else:
        decade_frequency[row] += 1

### Inserting Variables Into Strings

The str.format() method is a powerful tool that helps us write easy-to-read code while combining strings with other variables.

There are also extra things that str.format() can do with formatting numbers, but for now we'll focus on inserting values into strings.

We use the method with a string — which acts as a template — using the brace characters ({}) to signify where we want any variables to be inserted. We then pass those variables as arguments to the method.

- The variables are inserted into the {} by the order that we pass them as arguments.
- If we want to specify ordering and/or repeat numbers, we can use integers
- Lastly, if we want to make things clearer, we can give each variable a name — technically called using keyword arguments, which you may remember from learning about functions.

In [6]:
artist = "Pablo Picasso"
birth_year = 1881

result = '{}\'s birth year is {}'.format(artist, birth_year)
print(result)

Pablo Picasso's birth year is 1881


### Creating an Artist Frequency Table

We can use the same technique we used to create our earlier frequency table with one minor modification — we will be iterating over a list of lists instead of a simple list that we used to create our decades frequency table.

The only difference this makes is that we will first need to extract the value we want to count from the row before we start.

In [7]:
artist_freq = {}

for row in moma_data:
    artist = row[1] # Index 1 for Artist column
    if artist not in artist_freq:
        artist_freq[artist] = 1
    else:
        artist_freq[artist] += 1

#### Creating an Artist Summary Func

Now that we've created a dictionary containing the counts of each artist's artworks in our data set, the final part of our task will be creating a function that displays information for a specific artist.

Our function will take a single argument — the name of an artist — and will display a formatted sentence about that artist. The diagram below illustrates the input and output.

Inside the function, we'll need to:

- Retrieve the number of artworks by the artist from the artist_freq dictionary.
- Define a template for our output.
- Use str.format() to insert the artists name and number of artworks into our template.
- Use the print() function to display the output.

In [8]:
def artist_summary(artist):
    n_artworks = artist_freq[artist]
    output = 'There are {1} artworks by {0} in the data set'.format(artist, n_artworks)
    print(output)

artist_summary('Henri Matisse')

There are 1063 artworks by Henri Matisse in the data set


### Formatting Numbers Inside Strings

We specify number formatting, including things like precision, by adding one of various format specifications inside the braces ({}) of our string.

- To indicate the precision of two, we specify :.2f after the name or position of our argument
- Another useful format specification is to add a comma as a thousands separator, which prevents large numbers from being hard to read.
    - To add a comma, you would use the syntax ':,' inside the brackets, after the number or name of the variable you're inserting

Note that there is a specific order required – If we don't follow this order, our code will return a ValueError:

- The name or position of the of the variable
- A colon (:) to start the format spec
- The thousands separator
- The precision

In [9]:
pop_millions = [
    ["China", 1379.302771],
    ["India", 1281.935991],
    ["USA",  326.625791],
    ["Indonesia",  260.580739],
    ["Brazil",  207.353391],
]

output = "The population of {0} is {1:,.2f} million"

for row in pop_millions:
    country = row[0]
    population = row[1]
    print(output.format(country,population))

The population of China is 1,379.30 million
The population of India is 1,281.94 million
The population of USA is 326.63 million
The population of Indonesia is 260.58 million
The population of Brazil is 207.35 million


## CHALLENGE: Summarizing Artwork Gender Data

The final exercise for this mission will combine two techniques in order to analyze and display information about the frequencies of artwork by artists of different genders. The two techniques we'll combine are:

1. Creating a frequency table of the genders in the data set, which we have done for both artist ages and artists themselves.
2. Using the str.format() and the str formatting specification to display the data in an easy to read format, which we've done on both of the previous two screens.

One technique you'll need to use that we haven't encountered in the previous two missions is looping over a dictionary, which you'll need to use to display the data in the frequency table that you make.

We use the dict.items() method which returns each of the key-value pairs from our dictionary one-at-a-time. This helps us loop over dictionaries more easily.

In [12]:
gender_ft = {}

output = 'There are {1:,} artworks by {0} artists'

for row in moma_data:
    gender = row[5] # Index is 5 for Gender
    if gender not in gender_ft:
        gender_ft[gender] = 1
    else:
        gender_ft[gender] += 1

for key, value in gender_ft.items():
    print(output.format(key, value))

There are 104,720 artworks by Male artists
There are 8,398 artworks by Gender Unknown/Other artists
There are 17,742 artworks by Female artists
