# Thinking like a data scientist

As a data scientist, your ability to think scientifically will set you apart from the code-ninja-unicorns and number-crunchers who otherwise share many of your skills. In this course we'll work your scientific thinking muscles through drills, challenges, and capstone projects. Also, keep an eye out for "TLADS" or "Think Like a Data Scientist" asides scattered throughout the lessons – in those asides we'll show how to take the lesson's concept and implement it data-scientist-style.

Data scientists approach problems from three perspectives: curiosity, practicality, and skepticism.

Curiosity is something we're all born with. It's the engine behind "Why" and "How" questions, the desire to get behind the scenes, to know more than what we can see or understand right now. Curiosity leads to questions, and answering questions is data science's _raison d'être_, the reason the career exists.

Of course, not all questions can be answered with data, and some can't be answered at all. A skilled data scientist makes sure to define their questions in ways that are amenable to data-oriented solutions – this is where practicality comes in. "What is the meaning of life?" is not a question that can be answered with data. "What do people believe about the meaning of life?", on the other hand, is a practical question that falls squarely in the data scientist's realm.

Finally, a data scientist is skeptical. They not only describe what seems to be happening in a dataset but also how much confidence we can place in the results we see. The conclusions from a dataset will always be limited in various ways, from noisy data to unusual samples, and data scientists take those limitations into account when presenting their findings.

Making real-life questions testable
As a rule of thumb, questions that can be answered with a number or numbers can be addressed by a data scientist. For example, 56% of people might say the meaning of life is "42", or analysis of Facebook statuses might find 1000 different discussions mentioning the meaning of life in the last 24 hours.

Even if a question doesn't seem like it has a numeric answer, it can often be rephrased in a way that makes it answerable with data. For example, "Why do bad things happen to good people?" is a compelling question that has been debated for thousands of years. A data scientist might contribute to the conversation by breaking the question down in the following ways:

Providing a concrete definition of "bad things," such as "being the victim of a crime."
Providing a concrete definition of "good people," such as "people who volunteer once a month or more."
Collecting data, or finding an existing dataset, containing data from people who were and were not victims of crimes, and do and do not volunteer once a month or more. This creates four quadrants, or buckets. 

For example:

![Screen%20Shot%202019-02-27%20at%2011.30.20%20PM.png](attachment:Screen%20Shot%202019-02-27%20at%2011.30.20%20PM.png)

Our quadrant of people who volunteer more than once a month and are victims of crime are more likely to be women than any other quadrant. In addition, they have a unique combination of low income and high likelihood of living in a city. So we might tentatively conclude that "bad things happen to good people" because these people are more likely to also be poorer women who live in a city, a combination of variables that increases their risk.

Are there flaws in this approach? Of course! We don't know what is causing poorer women who live in a city to be more likely to volunteer and be victims of crime, we just know the characteristics are associated with each other. In addition, people might take issue with our variable definitions. They may argue that being a "good person" isn't (just) about volunteering, or that the category of "bad things" is much broader than just being victimized by crime. And they would be right. In translating the big question "Why do bad things happen to good people?" into something data science can tackle, we've lost a great deal of information and nuance. On the other hand, we've also learned something, perhaps laying the groundwork for a follow-up project. Lastly, note that we have to balance between trying to answer as much as possible with having confidence in our conclusions. Learning to ask a question that is significantly broad to be relevant but precise enough to be solvable is another aspect of this same skill.

Sometimes the way a question is translated is dictated by the data available. If a company wants to learn about the mental health of their employees, but the only data they have is a question about job satisfaction within a larger employee survey, then they can either decide that "satisfied with my job" can stand in for "mentally healthy," or they can spend the money and time to collect new data.

Ultimately, translating questions that people care about into questions data science can answer is a skill that improves with practice.

# Finding and evaluating data sources

Data is everywhere. The trick is to know where to look to find data that is relevant to our research questions. Data sources range from archives with carefully curated databanks full of information to the flood of information poured out every day on Twitter, Facebook, and the rest of the web. Unless you're an experienced web-scraper, we recommend sourcing the data for course projects from data archives and repositories like these.

Not all datasets are created equal. We recommend datasets supplemented with meta-data (data about data), including when and where the data was collected, what the population of interest was, the sampling technique used, and information about the individual variables. Meta-data may be available as a webpage, an additional data file, or (depending on the file format used) information embedded within the main datafile.

The dataset must also be available in a database or readable file format. Fortunately, the pandas package provides support for most widely-used file formats, and you learned about accessing CSV, JSON, and XML files in the fundamentals course.

# Evaluating uncertainty

It is easy to overstate the informative value of a statistic. Most people aren't used to looking at a number and wondering about the process that made it. One of the services a data scientist provides is to assess how certain we can be that conclusions based on a particular statistic are valid. Sources of uncertainty include the source of the sample, the size of the sample, and the amount of noise (variance) in the data.

Sampling a homogenous population, such as college students, leads to uncertainty when generalizing to other populations, such as working adults.

As you saw earlier, smaller samples lead to larger standard errors and wider margins of error. An election poll might project a 3% lead for candidate A, with a margin of error of plus or minus 5, meaning the true lead for candidate A could be anywhere from 8% to -2% (losing to candidate B). A larger sample would shrink that margin of error.

Noisy data is more difficult to summarize. Two different variables can be described with a mean of 30, but if Variable 1 has a range of values from 25 to 40 and Variable 2 has a range of values from -30 to 90, then the mean of Variable 1 tells us a lot more about the datapoints in Variable 1 than the mean of Variable 2 does for the datapoints in Variable 2.

# Translating statistical results into plain English

When reporting the results of statistics it is easy to lapse into jargon, specialized language like "variance" or "bimodal" that exactly describes your results but that may be opaque to others. Keep in mind your intended audience, and always ground your findings in the question you're trying to answer.

For example, instead of saying "Group 1 had a higher mean but lower variance than Group 2, while Group 2 had some left-leaning skewness," try "Web customers spent more than app customers, but app customers differed more among one another in how much they spent, with some spending as little as $.50 per visit."