<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

## Lab: Cleaning Rock Song Data

_Authors: Dave Yerrington (SF)_

---


In [25]:
import pandas as pd
import numpy as np 
import seaborn as sns

%matplotlib inline

### 1. Load `rock.csv` and do an initial examination of its data columns.

In [26]:
rockfile = "./datasets/rock.csv"

In [27]:
pwd

'/Users/mattmates/Desktop/general_assembly/homework'

In [28]:
# Load the data.
rock = pd.read_csv('rock.csv')

In [29]:
# Look at the information regarding its columns.
rock.head(15)

Unnamed: 0,Song Clean,ARTIST CLEAN,Release Year,COMBINED,First?,Year?,PlayCount,F*G
0,Caught Up in You,.38 Special,1982.0,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981.0,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980.0,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975.0,Art For Arts Sake by 10cc,1,1,1,1
5,Kryptonite,3 Doors Down,2000.0,Kryptonite by 3 Doors Down,1,1,13,13
6,Loser,3 Doors Down,2000.0,Loser by 3 Doors Down,1,1,1,1
7,When I'm Gone,3 Doors Down,2002.0,When I'm Gone by 3 Doors Down,1,1,6,6
8,What's Up?,4 Non Blondes,1992.0,What's Up? by 4 Non Blondes,1,1,3,3
9,Take On Me,a-ha,1985.0,Take On Me by a-ha,1,1,1,1


### 2.  Clean up the column names.

Let's clean up the column names. There are two ways we can accomplish this:

#### 2.A Change the column names when you import the data using `pd.read_csv()`.

Notice that, when passing `names=[..A LIST OF STRING..]` with a number of columns that matches the number of strings in the passed list, you replace the column names.

NOTE: When you create custom column names, the first row of the `.csv` already represents a header. It is important to tell `pandas` to skip that row. The `skiprows=1` keyword argument to `read_csv()` will tell `pandas` to skip the first row.

In [30]:
# Change the column names when loading the '.csv':
rock = pd.read_csv('rock.csv',
                  names = ['song','artist','release_year','song_artist','first','year','plays','year_x_plays'],
                  skiprows=1)

rock.columns

Index(['song', 'artist', 'release_year', 'song_artist', 'first', 'year',
       'plays', 'year_x_plays'],
      dtype='object')

#### 2.B Change column names using the `.rename()` function.

The `.rename()` function takes an argument, `columns=name_dict`, in which `name_dict` is a dictionary containing the original column names as keys and the new column names as values.

In [31]:
# Change the column names using the `.rename()` function.
col_names = {'Song Clean':'song'
             ,'ARTIST CLEAN':'artist'
             ,'Release Year':'release_year'
             ,'COMBINED':'song_artist'
             ,'First?':'first'
             ,'Year?':'year'
             ,'PlayCount':'plays'
             ,'F*G':'year_x_plays'}

rock.rename(columns=col_names)

Unnamed: 0,song,artist,release_year,song_artist,first,year,plays,year_x_plays
0,Caught Up in You,.38 Special,1982,Caught Up in You by .38 Special,1,1,82,82
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
2,Hold On Loosely,.38 Special,1981,Hold On Loosely by .38 Special,1,1,85,85
3,Rockin' Into the Night,.38 Special,1980,Rockin' Into the Night by .38 Special,1,1,18,18
4,Art For Arts Sake,10cc,1975,Art For Arts Sake by 10cc,1,1,1,1
...,...,...,...,...,...,...,...,...
2225,She Loves My Automobile,ZZ Top,,She Loves My Automobile by ZZ Top,1,0,1,0
2226,Tube Snake Boogie,ZZ Top,1981,Tube Snake Boogie by ZZ Top,1,1,32,32
2227,Tush,ZZ Top,1975,Tush by ZZ Top,1,1,109,109
2228,TV Dinners,ZZ Top,1983,TV Dinners by ZZ Top,1,1,1,1


#### 2.C Reassigning the `.columns` attribute of a DataFrame.

You can also just reassign the `.columns` attribute to a list of strings containing the new column names. 

The only caveat with reassigning `.columns` is that you have to reassign all of the column names at once. You can't partially replace a value by working on `.columns` directly. You have to reassign `.columns` with a list of equal length. 

In [32]:
# Replace the column names by reassigning the `.columns` attribute.
columns = ['song','artist','release_year','song_artist','first','year','plays','year_x_plays']

rock.columns

Index(['song', 'artist', 'release_year', 'song_artist', 'first', 'year',
       'plays', 'year_x_plays'],
      dtype='object')

### 3. Subsetting data where null values exist.

We have mixed `str` and `NaN` values in the `release` column. `NaN` stands for "not a number" and is the way `pandas` handles "nulls" or nonexistent data. We can use the `.isnull()` method of a Series to find null values.

Print the header of the data subset to where the `release` column is null values.

In [33]:
# Show records where df['release'] is null
rock.isnull().sum()

rock[rock['release_year'].isnull() == True]

Unnamed: 0,song,artist,release_year,song_artist,first,year,plays,year_x_plays
1,Fantasy Girl,.38 Special,,Fantasy Girl by .38 Special,1,0,3,0
10,"Baby, Please Don't Go",AC/DC,,"Baby, Please Don't Go by AC/DC",1,0,1,0
13,CAN'T STOP ROCK'N'ROLL,AC/DC,,CAN'T STOP ROCK'N'ROLL by AC/DC,1,0,5,0
16,Girls Got Rhythm,AC/DC,,Girls Got Rhythm by AC/DC,1,0,24,0
24,Let's Get It Up,AC/DC,,Let's Get It Up by AC/DC,1,0,4,0
...,...,...,...,...,...,...,...,...
2216,"I'm Bad, I'm Nationwide",ZZ Top,,"I'm Bad, I'm Nationwide by ZZ Top",1,0,10,0
2218,Just Got Paid,ZZ Top,,Just Got Paid by ZZ Top,1,0,2,0
2221,My Head's In Mississippi,ZZ Top,,My Head's In Mississippi by ZZ Top,1,0,1,0
2222,Party On The Patio,ZZ Top,,Party On The Patio by ZZ Top,1,0,14,0


### 4. Update slices of your DataFrame based on mask selection/slices.

In many scenarios, we want to upate values in our DataFrame according to criteria. Let's say we wanted to set all of the null values in `release` to 0.

With newer versions of `pandas`, in order to manipulate data in the original DataFrame, we have to use `.loc` while performing reassignment using a mask and an index.

For example, the following won't always work:
```python
df[row_mask]['column_name'] = new_value
```

The best way to accomplish the same task is:
```python
df.loc[row_mask, 'column_name'] = new_value
```

For multiple column assignment, you would use:
```python
df.loc[row_mask, ['col_1', 'col_2', 'col_3']] = new_value
```

#### 4.A Let's try it out. Make all of the null values in `release` 0.

In [34]:
# Replace release nulls with 0
rock.loc[rock['release_year'].isnull(),'release_year'] = 0

#Check that numbers of rows where 'release_year' = 0 is 577
rock[rock['release_year'] == 0].count()

song            577
artist          577
release_year    577
song_artist     577
first           577
year            577
plays           577
year_x_plays    577
dtype: int64

#### 4.B Verify that `release` contains no null values.

In [35]:
# A:
rock[rock['release_year'].isnull() == True]

Unnamed: 0,song,artist,release_year,song_artist,first,year,plays,year_x_plays


### 5. Ensure that the data types of the columns make sense. 

Verifying column data types is a critical part of data munging. If columns have the wrong data type, then there is usually corrupted or incorrect data in some of the observations.

#### 5.A Look at the data types for the columns. Are any incorrect given what the data represents?

In [36]:
# A:
rock.dtypes

#release year shoud be integer

song            object
artist          object
release_year    object
song_artist     object
first            int64
year             int64
plays            int64
year_x_plays     int64
dtype: object

### 6. Investigate and clean up the `release` column.

The `release` column is a string data type when it should be an integer.

#### 6.A Figure out what value(s) are causing the `release` column to be encoded as a string instead of an integer.

In [37]:
# A: 
rock['release_year'].describe()
rock['release_year'].unique()
rock.groupby('release_year')['release_year'].count()

#one record of 'SONGFACTS.COM'

release_year
0                577
1071               1
1955               1
1958               1
1961               1
1962               3
1963               9
1964              14
1965              28
1966              30
1967              61
1968              46
1969              72
1970              81
1971              75
1972              50
1973             104
1974              48
1975              83
1976              56
1977              83
1978              64
1979              63
1980              70
1981              61
1982              54
1983              60
1984              51
1985              39
1986              37
1987              39
1988              29
1989              32
1990              22
1991              34
1992              14
1993              19
1994              25
1995              10
1996               9
1997               9
1998               6
1999              13
2000               3
2001               4
2002               6
2003               3


#### 6.B Look at the rows in which there is incorrect data in the `release` column.

In [38]:
# A:
rock[rock['release_year'] == 'SONGFACTS.COM']

Unnamed: 0,song,artist,release_year,song_artist,first,year,plays,year_x_plays
1504,Bullfrog Blues,Rory Gallagher,SONGFACTS.COM,Bullfrog Blues by Rory Gallagher,1,1,1,1


#### 6.C. Clean up the data. Normally we may replace the offending data with null np.nan values, however we previously converted all of the nan values in the release column to zeros so we might as well continue with the same practice. Replacing with 0 (or nan) will allow us to convert the column to numeric.

In [39]:
# A: 
rock['release_year'] = rock['release_year'].replace('SONGFACTS.COM',0)
rock['release_year'] = pd.to_numeric(rock['release_year'])

#check that SONGFACTS.COM has been replaced
rock.groupby('release_year')['release_year'].count()

release_year
0       578
1071      1
1955      1
1958      1
1961      1
1962      3
1963      9
1964     14
1965     28
1966     30
1967     61
1968     46
1969     72
1970     81
1971     75
1972     50
1973    104
1974     48
1975     83
1976     56
1977     83
1978     64
1979     63
1980     70
1981     61
1982     54
1983     60
1984     51
1985     39
1986     37
1987     39
1988     29
1989     32
1990     22
1991     34
1992     14
1993     19
1994     25
1995     10
1996      9
1997      9
1998      6
1999     13
2000      3
2001      4
2002      6
2003      3
2004      5
2005      5
2006      1
2007      3
2008      3
2011      3
2012      5
2013      2
2014      2
Name: release_year, dtype: int64

### 7. Get summary statistics for the `release` column using the `.describe()` function.

Now that the `release` column is finally a numeric data type, we can apply the `.describe()` function.  

#### 7.A Print out the summary stats for the `release` column. What is the earliest and latest release date?

In [40]:
# A:
rock['release_year'].describe()

#rock.groupby('release_year')['release_year'].count()

count    2230.000000
mean     1465.331390
std       867.196161
min         0.000000
25%         0.000000
50%      1973.000000
75%      1981.000000
max      2014.000000
Name: release_year, dtype: float64

In [41]:
rock.dtypes

song            object
artist          object
release_year     int64
song_artist     object
first            int64
year             int64
plays            int64
year_x_plays     int64
dtype: object

#### 7.B Based on the summary statistics, is there anything else wrong with the `release` column? 

In [42]:
# A: replacing all nulls with 0s makes summary statstics meaningless. Probably should have kept them as null or remove them from the summary stats


_Looking at the DataFrame that contains the year 1071, we can see that the year was probably corrupted and should be replaced with something else if possible._

### 8. Make changes and investigate using custom functions with `.apply()`.

Let's say we want to traverse every single row in our data set and apply a function to that row.

#### 8.A Write a function that will take a row of a DataFrame and print out the song, artist, and whether or not the release date is < 1970.


In [43]:
rock.columns

Index(['song', 'artist', 'release_year', 'song_artist', 'first', 'year',
       'plays', 'year_x_plays'],
      dtype='object')

In [None]:
rock.apply(lambda x)

In [44]:
# A:
# for index, row in rock.iterrows():
#     if row['release_year'] < 1970:
#         a = 'True'
#     print('SONG:'+row['song']+', ARTIST:'+row['artist']+', IS_BEFORE_1970:'+a)

def meta_data(row_num):
    if rock['release_year'].iloc[row_num] < 1970:
        a = 'True'
    else:
        a = 'False'
    return ('SONG:'+rock['song'].iloc[row_num]+', ARTIST:'+rock['artist'].iloc[row_num]+', IS_BEFORE_1970:'+a)

In [45]:
meta_data(0)

'SONG:Caught Up in You, ARTIST:.38 Special, IS_BEFORE_1970:False'

In [46]:
rock['release_year'].iloc[0:15] <1970

0     False
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10     True
11    False
12    False
13     True
14    False
Name: release_year, dtype: bool

#### 8.B Using the `.apply()` function, apply the function you wrote to the first four rows of the DataFrame.

You will need to tell the `apply` function to operate row by row. Setting the keyword argument as `axis=1` indicates that the function should be applied to each row individually.

In [61]:
rock.apply(meta_data, axis=1)

ValueError: ("invalid literal for int() with base 10: 'Caught Up in You'", 'occurred at index 0')

You'll notice that there will be a final output Series of `None` values. The `.apply()` function, if a return value is not specified, will return a Series of `None` values (similar to how the default return for Python functions is `None` when a return statement is not specified).

### 9. Write a function that converts cells in a DataFrame to float and otherwise replaces them with `np.nan`.

If applied to our data, it would keep only the numeric information and otherwise input null values.

Recall that the try-except syntax in Python is a great way to try something and take another action if the initial step fails:

```python
try:
    Perform some action.
except:
   Perform some other action if the first failed with an error.
```

#### 9.A Write the function that takes a column and converts all of its values to float if possible and `np.nan` otherwise. The return value should be the converted Series.

In [60]:
# A:
def col_conv(col):
    try:
        if type(col) = int:
            float(col)
        
#rock.apply(pd.value_counts).fillna(np.nan) 

TypeError: ("cannot convert the series to <class 'float'>", 'occurred at index song')

In [56]:
float(rock.release_year)

TypeError: cannot convert the series to <class 'float'>

#### 9.B Try your function out on the rock song data and ensure the output is what you expected.


In [55]:
# A:
rock.apply(float_conv(col='release_year'))

#rock.release_year.dtypes

AttributeError: 'str' object has no attribute 'apply'

#### 9.C Describe the new float-only DataFrame.

In [52]:
# A: