# Computing the autocorrelation of a time series

Time series analysis has important applications in many areas including finance, electical engineering, biology, climatology etc. An important property of a time series is its autocorrelation function. The autocorrelation quantifies the average similarity between a signal and a shifted version of the same signal, as a function of the delay between the two. In other words, the autocorrelation can give us information about repeating patterns as well as the timescale of the signal's fluctuations. The faster the decay of the autocorrelation to zero, the faster the signal varies.

There are various definitions of the autocorrelation function depending on the domain. Here, we define the autocorrelation of a time series as:

$ R(k)= \frac{1}{N} \sum_{n} x_{n} x_{n+k} $


To illustrate computing the autocorrelation function we are going to use the names of babies born in the United States. The Social Security Administration applications are a great way to track trends in baby names over time. Data.gov releases two datasets that include the number of babies with a given name and gender born in a given year. Note that only names with at least 5 babies born in the same year per state are included in this dataset for privacy.

The `data` folder in the code repository contains a `Baby-names.zip` folder which has the national data downloaded from the [Social Security Administation's Baby Names portal](https://www.ssa.gov/oact/babynames/limits.html)

In [None]:
import os
import zipfile
import numpy
import pandas

# Setup the file path for zip file
data_folder = os.path.abspath('./data/')
    
# Read file from zip file
zip_file = os.path.join(data_folder, 'Baby-names.zip')

def list_files_from_zip(zf):
    """List files in a zip archive
    
    Parameters
    ----------
    zf : string or path
        Zip archive location
    """
    with zipfile.ZipFile(zf) as z:
        files = z.namelist()
    return list(files)

def read_csv_from_zip(zf, fn, **kwargs):
    """Read a CSV file from a zip archive
    
    Parameters
    ----------
    zf : zipfile.ZipFile instance
        ZipFile instance
        
    fn : str
        String name for data file
        
    Returns
    -------
    pandas.DataFrame
        Contents of CSV read into a DataFrame
    """
    with zipfile.ZipFile(zf) as z:
        matching = [s for s in z.namelist() if fn in s]
        if not matching:
            raise FileNotFoundError('File {} not in zip archive'.format(fn))
        else:
            zipped_data_file = matching[0]
            # Read data from CSV file and confirm type of data
            return pandas.read_csv(z.open(zipped_data_file), **kwargs)

print(data_folder)
print(zip_file)

In [None]:
# Get a list of file names for analysis
files = [file for file in list_files_from_zip(zip_file) if file.startswith('yob')]

# Extract the years from file names
years = sorted([int(file[3:7]) for file in files])

# Extract data from all files into a dictionary 
# The dictionary is indexed by year
# The data from each file is read into a pandas.DataFrame
data = { int(year): read_csv_from_zip(zip_file, f, index_col=0, header=None, names=['first_name', 'gender', 'number'])
         for f, year in zip(files, years)
       }

# Each file is read into a pandas.DataFrame
print(type(data[1996]))

In [None]:
# pandas.DataFrame is indexed by the names column in each file
data[2016].head()

In [None]:
def get_value(name, gender, year):
    """Return the number of babies born a given year,
    with a given gender and a given name.
    
    Parameters
    ----------
    name: string
        Baby name
        
    gender: string
        Gender of the baby. M for male and F for female
        
    year: int
        Year of birth as integer
    
    Returns
    -------
    value: int
        Number of babies born with given name
    """
    dy = data[year]
    try:
        return dy[dy['gender'] == gender]['number'][name]
    except KeyError:
        return 0
    
print(get_value('Sophie','F', 2000))

In [None]:
def get_evolution(name, gender):
    """Evolution of a baby name over the years
    
    Parameters
    ----------
    name: string
        Baby name
        
    gender: string
        Gender of the baby, either 'M' or 'F'
    
    Returns
    -------
    numpy.ndarray
        Array of values representing the number with given
        name and gender through the entire dataset
    """
    return numpy.array([get_value(name, gender, int(year)) for year in years])

In [None]:
def autocorrelation(x):
    """Compute the autocorrelation of a array
    
    This function uses NumPy's correlate function in 
    'full' mode. This returns a convolution that includes
    the -ve offset. We return the result from time offset
    0. Refer to the numpy.convolve documentation for more
    information.
    
    Parameters
    ----------
    x : numpy.ndarray(shape=(1,))
        1-D array of values
    
    Returns
    -------
    numpy.ndarray(shape(1,))
        Autocorrelation function of x
    """
    result = numpy.correlate(x, x, mode='full')
    return result[result.size // 2:]

In [None]:
def plot_autocorrelation_name(name, gender, color, axes=None):
    """Compute and plot the autocorrelation given name and gender
    
    Parameters
    ----------
    name : string
        Baby name
        
    gender : string
        Gender of the baby. Either 'M' or 'F'
        
    color : string
        Valid matplotlib single character color name for 
        the plot
        
    axes : matplotlib Axes instance
        Matplotlib Axes instance holding the two axes for plotting
        the signal and autocorrelation function, respectively.
    """
    x = get_evolution(name, gender)
    z = autocorrelation(x)

    # Evolution of the name.
    axes[0].plot(years, x, '-' + color, label=name)
    axes[0].set_title("Baby names")
    axes[0].legend()

    # Autocorrelation.
    axes[1].plot(z / float(z.max()), '-' + color, label=name)
    axes[1].legend()
    axes[1].set_title("Autocorrelation")

In [None]:
# Compare the autocorrelation of some female names
%matplotlib inline

import matplotlib.pyplot as plt

fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_autocorrelation_name('Jennifer', 'F', 'r', axes=axes)
plot_autocorrelation_name('Maria', 'F', 'g', axes=axes)
plot_autocorrelation_name('Anna', 'F', 'b', axes=axes)

In [None]:
# Gender-neutral name examples
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_autocorrelation_name('Skyler', 'F', 'g', axes=axes)
plot_autocorrelation_name('Skyler', 'M', 'r', axes=axes)