This kernel provide a quick analysis of the CDC birth dataset to check data consistency and an aditional visialization of birth variation per weekday.

## Load data

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from __future__ import print_function
sns.set()
df = pd.read_csv('../input/births.csv')
    
df.head()

## Cleaning

In [None]:
df.describe()

=> Column day as abnormal values

In [None]:
df.day.unique()

### remove days with abnormal values

In [None]:
df = df[(df.day>=1) & (df.day<=31)]


In [None]:
ndays = df[['year', 'day']].groupby('year').count()
plt.bar(ndays.index, ndays.day)
plt.title("Nb of days available per year (both genders)")
lines = plt.plot(plt.gca().get_xlim(), 365*2*np.ones(2), 'g-')

=> Look good, except for a few days in excess from 1969 to 1978

## Check data availability on a per month basis

In [None]:
daycount = df.copy()
daycount.day.values[:] = 1
daycount.pivot_table('day', index=['year', 'gender'], columns=['month'], aggfunc=np.sum).T

=> Looks like February has earned some extra days from 1969 to 1978, which explains the excess seen previously 

## Check consistency of birthes values

In [None]:
ax = sns.distplot(df.births, norm_hist=False)

## Check monthes with abnormal birth values

In [None]:
baddays = df[df.births < 1000].copy()[['year', 'month', 'day']]
baddays.groupby(['year', 'month']).count().T



## Remove abnormal birth values

In [None]:
df = df[df.births > 1000]

## Visualisation of births

In [None]:
def showbirth():
    fig = plt.figure(figsize=(15,4))
    males = df.gender == 'M'
    females = df.gender == 'F'
    plt.plot(df.year[males]+df.month[males]/12.+df.day[males]/365., 
             df.births[males], 
             '+', label='Males')
    plt.plot(df.year[females]+df.month[females]/12.+df.day[females]/365., 
             df.births[females], 
             'x', label='Females')
    plt.xlabel('year')
    plt.ylabel('births')
    plt.legend()
    return fig
fig = showbirth()
plt.title('Nombre de naissance sur les jours entre 1969 et 1989')
limits = plt.xlim(1969, 1989)

## Zoom on one year

In [None]:
import datetime
fig = showbirth()
plt.title('All days of 1970')
plt.xlim(1970, 1971)
xticks = plt.xticks(1970 + np.arange(12)/12.)
monthes = [ datetime.date(2000, month, 1).strftime('%B') for month in range(1, 13)]
labels = fig.get_axes()[0].set_xticklabels( monthes )

There is a clear seasonal variation which is a known effect

A weekly variation is responsible for the bimodal effect visible on the 1969-1988 view

## Add a `dayname` column to dataset

In [None]:
datetimes = pd.to_datetime(df[['year', 'month', 'day']], errors='coerce')

bad_dates = datetimes.isna()
print("Number of invalid date values found:", bad_dates.sum())


In [None]:
dayname = datetimes.map(lambda dt: dt.day_name())
df.insert(0, 'dayname', dayname)
df.head()


# Birthes per weekday

In [None]:
df['decade'] = pd.cut(df.year, [1960, 1970, 1980, 1990], labels=[ "60's", "70's", "80's "])
pivot = pd.pivot_table(df, values='births', index='dayname', columns='decade', aggfunc=np.mean)
plt.rcParams['figure.figsize'] = (10,5)
ax = pivot.plot()

# for an unknown reason daynames does not appear on this very plot
ax.set_xticks(np.arange(7)) 
ax.set_xticklabels(pivot.index)
pivot

## Observations

On Saturday and Sunday there is a significant drop in birthes. Hopefully this effect is not due to pregnant woman having to wait Monday ;) but it is rather that the medical operations like cesarean are performed during the week. Hence during the workdays the number of birthes is *higher*.

>    Where are the Sunday babies  
>    https://www.ncbi.nlm.nih.gov/pubmed/17891531
