**Hi everyone! So today I'm going to be looking at inaccuracies in the NYT data for Alameda County that question the trustworthiness of their data collection. I am also going to compare this to the graphs the NYT has online when you search up Covid cases for Alameda County.**

I want to be clear that this is not a reflection of my political ideology and I'm not doing this to prove that COVID is a hoax or anything like that. I'm just doing this because I want people to do their own research on data before believing in something and because I want NYT to clear up the disrepancies in the data.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
import plotly.offline as py
py.init_notebook_mode(connected = True)
import plotly.graph_objs as go
from plotly.offline import plot, iplot
import plotly.express as px
pd.options.mode.chained_assignment = None

In [None]:
df = pd.read_csv('/kaggle/input/us-counties-covid-19-dataset/us-counties.csv')
df['date'] = pd.to_datetime(df['date'])
df.head()

There are multiple counties and states being used here but for our problem we're just going to focus on Alameda County in California.

In [None]:
alameda_county = df[(df['state'] == 'California') & (df['county'] == 'Alameda')]
px.bar(data_frame= alameda_county, x = 'date', y = 'cases', color = 'deaths')

Note how the data is always increasing. This is because the data is aggregated meaning that each case adds from the previous. The deaths are also aggregated. To unaggregate the data, we will use the .diff() function in pandas.

Strangely though, notice how towards the beginning of July 2021, the deaths all of the sudden go down and then start slowly increasing again. We will come back to this in a bit, for now lets see the results of unaggregating the data.

In [None]:
alameda_county['cases'] = alameda_county['cases'].diff()
alameda_county['deaths'] = alameda_county['deaths'].diff()
px.bar(data_frame= alameda_county, x = 'date', y = 'cases')

So for the most part, the number of cases when we take the difference seems fine, however you may notice that on July 3rd 2021 there seems to be -85 cases. Let's take a closer look at the data near the day.

In [None]:
px.bar(data_frame= alameda_county[(alameda_county['date'].dt.year == 2021) & (alameda_county['date'].dt.month == 7)], x = 'date', y = 'cases')

In [None]:
alameda_county[alameda_county['cases'] < 0]

In [None]:
df.iloc[[1478979]]

So for some reason July 3rd has -85 cases. On NYT's actual graph it says that on July 3rd there were 0 new cases in Alameda county. This is kind of worrying because it affects all the cases past this day and I know this is the same data.

Lets take a look at the deaths as well to see if their are any disrepancies there.

In [None]:
px.bar(data_frame=alameda_county, x = 'date', y = 'deaths')

Weirdly enough, according the the graph there is data that reaches about -400 deaths, making the result of the data look tiny in comparison. I'm going to graph this again but without that huge negative number so we can get a better idea of how the graph looks.

In [None]:
px.bar(data_frame=alameda_county[alameda_county['deaths'] > -400], x = 'date', y = 'deaths')

There still seems to be multiple points that are negative, let's check out all these points.

In [None]:
alameda_county = alameda_county.reset_index(drop = True)
alameda_county[alameda_county['deaths'] < 0]

Wow there are a lot of points on here just for Alameda County that have negative deaths. Let's compare it to what NYT actually puts out for the amount of deaths in some of these points.

In [None]:
alameda_county.iloc[[232]]

On October 19th of 2020 we have -9 deaths.
 
NYT has the amount of deaths as 0.

In [None]:
alameda_county.iloc[[307]]

On January 2nd of 2021 we have -1 deaths

NYT has the amount of deaths as 0.

In [None]:
alameda_county.iloc[[384]]

On March 20th of 2021 we have -10 deaths

NYT has the amount of deaths as 0.

In [None]:
alameda_county.iloc[[460]]

On June 4th of 2021 we have -423 deaths

NYT has the amount of deaths as 0.

In [None]:
alameda_county.iloc[[492]]

On July 6th of 2021 we have -2 deaths

NYT has the amount of deaths as 1.

I'm noticing a fairly common theme of putting 0 as the amount of deaths if the death counter is negative.

Again this is fairly worrying because this is real data that NYT uses and that millions of people see the graphs of everyday when looking at Covid cases and deaths, yet just in Alameda County we see multiple disrepancies in the data that really questions the validity of the data NYT has.

**Just to remind ourselves the magnitude of this problem, the data for deaths is cumulative in the original dataframe which means that everytime there is a new Covid death it adds up from the previous Covid deaths. Yet somehow for multiple dates, something went wrong in the cumulative process causing the amount of deaths to decrease from one day to another which is impossible. But if the deaths at some point decreased from one day to another, then how can we trust any of the data past the negative points because the next dates takes on the deaths from those disrepancies!**

Another thing that's worrying to note is how for July the amount of deaths is actually 1 as listed by the NYT instead of usually putting down 0 if the amount of deaths is negative. In fact the whole month of July is just weird, going back to our first cumulative graph.

In [None]:
alameda_county_2 = df[(df['state'] == 'California') & (df['county'] == 'Alameda')]
px.bar(data_frame= alameda_county_2, x = 'date', y = 'cases', color = 'deaths')

For the most part we can see that the amount of deaths are cumulative, indicated by the color getting lighter, and yet on June 4th, for some reason, the amount of deaths somehow decrease from 1687 to 1264 even though the data is cumulative. This trend for some reason then continues with the rest of the dates having somewhere near 1200 cumulative deaths and completely ignoring how it randomely decreased the amount of deaths.

Again there is no explanation offered for why there is a significantly huge change in the amount of deaths from June 3rd to June 4th and the amount of cases seems to still be cumulative (except on July 3rd where we see that the number of cases has randomely decreased and where our difference of -83 cases came from).

P.S. Just to note the change from June 3rd to June 4th is actually where the -423 deaths come in because we took the difference of June 4th - June 3rd or 1264 - 1687.

So at this point I'm very worried and doubtful about the trustworthiness of the NYT data. If for just Alameda County the amount of deaths and cases are this messed up, then I can't imagine how bad it may be for all the other counties out there.

**I've stated this once and I'll state this again, the data for deaths and cases are cumulative, so if there is any disrepancy for one day, it affects every other day after it, meaning that every day after it is wrong! So when the NYT posts data for the Alameda County and deaths somehow keep decreasing with cumulative data and no explanation, the data becomes untrustworthy**

I hope you've enjoyed reading this and now will go off to question data more often too instead of just believing a graph on a website. Have a great day!