# 0. 5th century BCE to 2019

Sources:

* [wikipedia](https://en.wikipedia.org/wiki/Data_science)
* [History of statistics](https://en.wikipedia.org/wiki/History_of_statistics)
* [Google](https://www.google.com/)
* [scimagojr](https://www.scimagojr.com/)
* Web

## Pre Data Science era

### Early events

* 5th century BCE first use of statistical methods (The most frequent value (in modern terminology - the mode ))
* 9th-century - Forms of probability and statistics were developed by Al-Khalil (717–786 CE). Book of Cryptographic Messages which contains the first use of permutations and combinations to list all possible Arabic words with and without vowels.
* 9th-century Arabic book entitled Manuscript on Deciphering Cryptographic Messages
* 14th-century history of Florence by the Florentine banker and official Giovanni Villani, includes much statistical information on population, ordinances, commerce and trade, education
* The idea of the median originated in Edward Wright's book on navigation (Certaine Errors in Navigation) in 1599
* Christiaan Huygens (1657) gave the earliest known scientific treatment of the subject.
* Pierre-Simon Laplace (1774) made the first attempt to deduce a rule for the combination of observations from the principles of the theory of probabilities. He represented the law of probability of errors by a curve and deduced a formula for the mean of three observations.
* The method of least squares, which was used to minimize errors in data measurement, was published independently by Adrien-Marie Legendre (1805), Robert Adrain (1808), and Carl Friedrich Gauss  (1809)


### The birth of statistics is often dated to 1662

*  John Graunt, along with William Petty, developed early human statistical and census methods that provided a framework for modern demography. He produced the first life table, giving probabilities of survival to each age. His book Natural and Political Observations Made upon the Bills of Mortality used analysis of the mortality rolls to make the first statistically based estimation of the population of London. 

### 18th century

*  term statistics, found in 1749 in Germany
* The term statistics is ultimately derived from the New Latin statisticum collegium ("council of state") and the Italian word statista ("statesman" or "politician")
* 18th century, the term "statistics" designated the systematic collection of demographic and economic data by states

### 19th century

* The first book to have 'statistics' in its title was "Contributions to Vital Statistics" (1845) by Francis GP Neison


![](https://i.imgur.com/WbpzSL6.png)

## Data science era

The term "data science" has appeared in various contexts over the past thirty years but did not become an established term until recently.

> Naur later introduced the term "datalogy"

### 2001

* William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics,"

### 2003

* Columbia University began publishing The Journal of Data Science


### 2007

* Turing award winner Jim Gray envisioned "data-driven science" as a "fourth paradigm" of science that uses the computational analysis of large data as primary scientific method


### April 2010

Kaggle is Founded

### 2012

* Harvard Business Review article "Data Scientist: The Sexiest Job of the 21st Century"

### 2013

* the IEEE Task Force on Data Science and Advanced Analytics was launched
* the first "European Conference on Data Analysis (ECDA)" was organised in Luxembourg, establishing the European Association for Data Science (EuADS)

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))
        

# Any results you write to the current directory are saved as output.

In [None]:
# read data
df_multi_answers = pd.read_csv('/kaggle/input/kaggle-survey-2019/multiple_choice_responses.csv')

In [None]:
# helper method for visualization of value counts
import seaborn as sns
import matplotlib.pyplot as plt

def plot_value_counts(res, size, title, x, y):
    top_res = res.head(size)

    chart = sns.barplot(top_res.index, top_res.values, alpha=0.8)
    chart.set_xticklabels(chart.get_xticklabels(), rotation=30, horizontalalignment='right')
    plt.title(title)
    plt.ylabel(x, fontsize=12)
    plt.xlabel(y, fontsize=12)
    plt.show()

# 1. 2019

# 1.1. Data science in Kaggle per country

Recent discoveries and progress in Data Science shook the world. One after another the big countries like: USA, China, India, Brazil invest more efforts in the new areas of Data Science in search of faster, better and more precise algorithms. People like: Andrew Ng, Yann LeCun, Yoshua Bengio influenced thousands in this new sourcery. Their energy, passion, vision show the path and open new areas for exploration.

Thousand of followers from different countries and ages started this magnificent journey to unveal answears of important questions like:
* How did life begin?
* Are we alone in the universe?
* What is consciousness, intelligence?
* Can we live for ever?
* How do we beat bacteria?

Lets have a look on the countries and how they are presented in Kaggle - which is advertised as "Your Home for Data Science":

In [None]:
countries = df_multi_answers.Q3.value_counts()
plot_value_counts(countries, 15, "Top countries Kaggle", 'Scientists', 'Country')

We can notice two interesting points here:
* dinosaur tail pattern - the first countries lead with huge advantage to the next ones
* the results has some unexpected countries (if we compare with [Scimago Journal & Country Rank](https://www.scimagojr.com/countryrank.php?area=1700&category=1706)

Below you can find the top countries per papers:

**1996 - 2018**

![](https://i.imgur.com/TYn0vgr.png)

**2018**

![](https://i.imgur.com/NCTp8pX.png)

We can see some discrepancies like:
* Country number one by papers China - has less participants in Kaggle than expected
* Brazil is not present in top ten by publications

Surprises for me are India and Brazil who show desire and future potential to grow in those areas. Now lets check top countries per age:

# 1.2. Top countries per age groups

Checking the chart below I would expect interesting discoveries and names from those counties:
* India
* Brazil
* China
* Nigeria
* Turkey

The rest of the countries show more stable age diversity.

Another interesting point which we can find below is the pioneering countries like:

* United States of America
* Canada
* Russia 

which has more participants in the groups: 60-69, 70+

In [None]:
top_countries = df_multi_answers.Q3.value_counts().head(15).index
print(top_countries)
df_multi_answers[df_multi_answers.Q3.isin(top_countries)].groupby(['Q3']).Q1.value_counts()
df_top_countries = df_multi_answers[df_multi_answers.Q3.isin(top_countries)]


pd.crosstab(df_top_countries.Q3, df_top_countries.Q1,
                  rownames=['country'], colnames=['age'])

# 1.3 Participants vs Kaggle visitors vs World

In the survey about 19 K people participated which seems not bad result if we compare for example with StackOverFlow survey where about 100,000 join it. If we research the top web analytics sites we can find that Kaggle.com has 50 K up to 300 K visits daily which makes good percentage of participants. Another interesting parallel will be comparison of scientist of older times and now.

From this map: [Global literacy today](https://ourworldindata.org/literacy)

![](https://i.imgur.com/9QqeJ5d.png)

Big parts of the world was illiterate in the past. Nowadays more and more people have proper education and knowledge required in the modern world. But I'll would argue and say that we are living in information era and people who don't understand data, how information travels and what benefits information brings are illiterate. 

So I can find parallel with Data Science knowledge and illiteracy in the ancient times.

Currently this society represents ~ 0.0000025 from world population


# 2. Other Text Responses

# 2.1 Does Data Scientist Produce Good Quality Data?

#### 2.1.1 How to deal with errors and free text answers

![https://imgur.com/a/VT2K3om](https://i.imgur.com/QiZ2t6K.jpg)



Data drives most of the major decisions in the world today. The same was in the past and the same will be in the future. In order to understand how data change history of humanity we can have a look on some of the most important events in human history related to bad data:
* **Columbus’s Geographical Miscalculations** - Taken together, the two miscalculations effectively reduced the planetary waistline to 16,305 nautical miles, down from the actual 21,600 or so, an error of 25 percent. Source: [Columbus’s Geographical Miscalculations](https://spectrum.ieee.org/tech-talk/at-work/test-and-measurement/columbuss-geographical-miscalculations)
* **Mars Orbiter miscalculation** - A NASA review board found that the problem was in the software controlling the orbiter’s thrusters. The software calculated the force that the thrusters needed to exert in pounds of force. A second piece of code that read this data assumed it was in the metric unit—“newtons per square meter”. Source: [When NASA Lost a Spacecraft Due to a Metric Math Mistake](https://www.simscale.com/blog/2017/12/nasa-mars-climate-orbiter-metric/)


But at that time Data Science was not so popular and less tools were available. And about 2019 and community of data scientist? Does Data-science community produce better quality data now. Lets look on the free text response below:

In [None]:
# read data
df_other = pd.read_csv('/kaggle/input/kaggle-survey-2019/other_text_responses.csv')
df_other.shape

In [None]:
primary_tool = df_other['Q14_Part_3_TEXT'].value_counts()
plot_value_counts(primary_tool, 10, 'What is the primary tool that you use at work?', 'Number of Answers', 'Tool')

In the top 10 answers we can see: 
* Tableau       , tableau        , Tableau        , 
* Power BI       , PowerBI        , Power Bi        , Power bi        

There are many variations of one answer which makes finding the correct stats a bit difficult. In order to find what are the real values we will try to group similar answer by using: `difflib`. This will help us to count values above as one single value.

For simplicity let start from less freequent answers to the most frequent once. This part can be improved. This is how `difflib` works:

Searching in all unique values for similar of **Tablequ** and the result is:
> ['Tablequ', 'Tableu', 'Tableau', 'Tablaeu', 'Tableau ', 'tableu', 'Tableau I', 'tableau', 'Tabluea', 'Tableai', 'Tableau SF', 'Tableau,SAS', 'Talend', 'Tableau, QLik', 'Tableau, Boxi', 'Qlik, Tableau']

In [None]:
import difflib 

correct_values = {}
words = df_other.Q14_Part_3_TEXT.value_counts(ascending=True).index

for keyword in words:
    similar = difflib.get_close_matches(keyword, words, n=20, cutoff=0.6)
    for x in similar:
        correct_values[x] = keyword
             
df_other["corr"] = df_other["Q14_Part_3_TEXT"].map(correct_values)

In [None]:
# Using similar values
correction = df_other["corr"].value_counts()
plot_value_counts(correction, 10, 'What is the primary tool that you use at work?', 'Number of Answers', 'Tool')

#### 2.1.2 How to investigate chaotic the data

In [None]:
df_other.columns

In [None]:
# View sample data
df_other.head()

In [None]:
# list questions and column names
import pandas as pd
pd.set_option('display.max_colwidth', 1000)
df_other.head(1).T.head()

In [None]:
# List top values per question
for col in df_other.columns:
    print(col, end=' - ')
    print(df_other[col][0])
    display(df_other[col].value_counts().head(10))

# 3. Conclusion

> Where data flow prosperity grow

My conclusion is that this new interesting science area is well presented by Kaggle. Most people who are unaware of Data Science revolution will be surprised to find that world is changing so fast. Many disruptive changes arose from Data Science and many more are expected.

In 21 century if you don't know Data Science you are **illiterate**.

> Times are different but problems and questions are similar.

![](https://i.imgur.com/y4wQMsG.png)

Good luck swimming in data!